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preface 


Le  present  volume  contient  les  textes  des  communications 
presentees  4  la  Neuvidme  Conference  Internationale  sur 
les  Methodes  de  Calcul  Scientifique  et  Technique. 

Les  actes  de  cette  Conference  refietent  revolution  du 
calcul  scientifique  au  cours  de  ces  dernidres  ann6es,  £ 
savoir  : 

1/  Au  plan  mdthodologique 

Un  recours  accru  aux  calculateurs  vectoriels  et  paralieies, 
impliquant  la  necessite  de  repenser  I'algorithmique 
numerique  et  les  logiciels  permettant  de  la  mettre  e  n 
oeuvre  sur  ces  nouvelles  machines. 

2/  Au  plan  des  applications 

La  resolution  des  probiemes  complexes  issus  de  la 
simulation  numerique  en  hypersonique,  en  micro- 
eiectronique,  en  chimie  quantique,  en  combustion  ou 
associes  au  developpement  en  cours  concernant  les 
stations  spatiales  habit6es. 

3/  Au  plan  des  math6matlques  appliqu6es 

Le  developpement  de  methodes  nouvelles  telles  que  les 
ondelettes,  dont  le  domaine  ^application  est  sans  cesse 
croissant,  en  particulier  grSce  £  ('interaction  des 
disciplines  scientifiques,  suscit6e  par  des  manifestations 
telles  que  cette  Conference. 

Nous  esperons  que  les  participants  beneficieront  des 
communications  presentees  tors  de  cette  rencontre,  et  des 
possibilites  de  communiquer  directement  avec  les 
conferenciers  et  autres  sp6cialistes  presents  £  cette 
Conference.  Bien  entendu,  nous  souhaitons  qu'ils  y  trouvent 
de  nouveaux  outils  leur  permettant  de  resoudre,  de  fagon 
plus  efficace,  les  probldmes  scientifiques  difficiles, 
rencontres  dans  le  cadre  de  leurs  activites 
professionnelles. 


Les  Organisateurs. 


FOREWORD 


The  present  volume  contains  the  texts  of  the  oral 
communications  presented  at  the  Ninth  International 
Conference  on  Computing  Methods  in  Applied  Sciences  and 
Engineering. 

The  proceedings  of  this  Conference  demonstrate  the 
evolution  undertaken  by  scientific  computing  during  these 
last  years  : 

1/  At  the  methodological  level 

An  increasing  use  of  vector  and  parallel  machines 
implying  the  necessity  of  redefining  numerical  algorithms 
and  softwares  in  view  of  computations  on  these  new 
machines. 

2/  At  the  application  level 

The  solution  of  complicated  problems  on  hypersonic  flow, 
microelectronic,  quantum  chemistry,  combustion,  space 
station  programs... 

3/  At  the  applied  mathematical  level 

The  development  of  new  methods,  such  as  wavelets,  whose 
field  of  applications  is  steadily  necessary,  thanks,  in 
particular,  to  the  interaction  taking  place  at  scientific 
meetings,  such  as  the  present  one. 

We  hope  that  the  participants  will  benefit  from  the 
communications  presented  at  this  Conference  and  will 

have  the  possibility  of  communicating  directly  with  the 
speakers  and  other  specialists  of  the  field.  We  also  hope 
that  these  scientists  will  find  new  tools  allowing  them  to 
solve  with  more  efficiency  the  difficult  problems  they 
encounter  in  their  professional  life. 


The  Organizers 
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Abstract 


A  review  of  the  direct  simulation  Monte  Carlo  (DSMC)  method  of  Bird  is 
presented.  The  DSMC  method  provides  the  capability  of  simulating  real  gas 
flows  in  the  rarefied  flow  regime.  Recent  developments  and  applications  of 
the  method  for  hypersonic  flows  are  reported  for  both  ground-based  tests  and 
during  entry.  Results  obtained  using  both  axisymmetric  and 
three-dimensional  codes  are  included. 

Nomenclature 

c  velocity 

cr  magnitude  of  the  relative  velocity  (speed)  between  two 

molecules 

Cp  drag  coefficient  3 

Ch  heat-transfer  coefficient,  2  q/p„V„ 

Ci  species  mass  fraction 

Cl  lift  coefficient 

C*  p*T„/p„T* 

f  normalized  velocity  distribution  function  in  velocity  space 

Ffj  number  of  physical  molecules  represented  by  each  simulated 

molecule 

Kn  Knudsen  number,  A/& 

2 

Kr  Cheng's  parameter,  p-Rn/p.V.C* 

l  characteristic  dimension 

L/D  lift-to-drag  ratio 

n  number  density 

n  nondimensional  density  rise  In  shock  wave, 

(P  -  Pi)/(P2  -  Pi) 

N  number  of  simulated  molecules  in  a  computational  cell 

p  pressure 

q  heat  flux 

R|H  nose  radius 

t  time 

At  flow  time  step 

T  temperature 

vc  volume  of  computational  cell 

V„  freestream  velocity 

x,y,z  Cartesian  coordinates 

a  angle  of  incidence 

6  shock  wave  thickness 

n  coordinate  normal  to  body  surface 

A  mean-free  path 

p  viscosity 

p  density 

o  total  collision  cross  section 


Abbreviations 


AFE  Aeroassist  Flight  Experiment 

ASTV  Aeroassisted  Space  Transfer  Vehicle 

DSMC  direct  simulation  Monte  Carlo 

FM  free  molecular 

NASP  National  Aero-Space  Plane 

NTC  no  time  counter 

TSS-2  Tethered  Satellite  System  -  2 

VHS  variable  hard  sphere 

VSL  viscous  shock-layer 


Introduction 


With  the  commitment  of  several  nations  to  expand  the  current  space 
transportation  capabilities  through  the  use  of  transatmospheric  winged 
vehicles  and  aeroassisted  space  transfer  vehicles  (ASTV's),  attention  is  now 
being  seriously  focused  on  the  aerothermodynamics  of  vehicles  at  very  high 
altitudes.  In  this  rarefied  flow  regime,  the  molecular  mean-free  path  in 
the  gas  becomes  significant  when  compared  with  either  a  characteristic 
distance  over  which  important  flow-property  changes  take  place  or  when 
compared  with  the  size  of  the  object  creating  the  flow  disturbance.  Since 
the  flow  is  hypersonic,  the  flow  disturbance  that  envelopes  the  space 
vehicle  will  be  a  nonequilibrium  flow;  that  is,  one  in  which 
nonequilibrium1"2  exists  among  the  various  energy  modes  (translational  and 
internal),  the  chemistry,  and  radiation3-5  for  the  more  energetic  flows. 

For  such  flows,  the  shock-wave  thickness  is  significant6-9  (shown  in  Fig.  1 
is  the  location  in  altitude-velocity  space  where  the  shock  wave  Ss  is  10 
percent  of  the  shock  standoff  for  a  10-cm  nose  radius),  and  the  flowfield 
disturbance  created  by  the  vehicle  is  very  much  larger  than  that  experienced 
under  continuum  flow  conditions.  At  the  surface  of  the  vehicle,  the 
gas-surface  interactions  are  very  important10-11  as  they  significantly 
influence  the  aerodynamic  forces  and  heating  that  the  vehicle  experiences. 

To  describe  such  flows,  one  must  acknowledge  the  discrete  nature  of  the 
flow.  Because  of  the  limitations  of  the  continuum  description,  as  expressed 
by  the  Navier-Stokes  equations,  to  simulate  rarefied  flows  and  the 
difficulties  of  solving  the  Boltzmann  equation,  which  acknowledges  the 
discrete  nature  of  the  flow,  direct  physical  simulation  methods  have  been 
developed  over  the  last  three  decades  for  modeling  rarefied  effects.  These 
developments  have  generally  been  concerned  with  the  direct  simulation  Monte 
Carlo  (DSMC)  method12-17  and,  to  a  lesser  extent,  with  the  Molecular 
Dynamics  method.17-16  The  direct  simulation  Monte  Carlo  (DSMC)  method  of 
Bird  Is  the  most  used  method  today  for  simulating  rarefied  flows  in  an 
engineering  context.  The  DSMC  method  takes  advantage  of  the  discrete 
structure  of  the  gas  and  provides  a  direct  physical  simulation  as  opposed  to 
a  numerical  solution  of  a  set  of  model  equations.  This  is  accomplished  by 
developing  phenomenological  models  of  the  relevant  physical  events. 
Phenomenological  models  have  been  developed  and  Implemented  In  the  DSMC 
procedure  to  account  for  translational,  thermal,  chemical,  and  radiative 
nonequilibrium  effects.  The  present  discussion  will  review  the  general 
features  of  the  DSMC  method,  the  numerical  requirements  for  obtaining 
meaningful  results,  the  modeling  used  to  simulate  high-temperature  gas 
effects,  and  applications  of  the  method  to  calculate  the  flow  about  various 
configurations  under  hypersonic  low-density  conditions.  Results  obtained 
using  both  axlsymmetrlc  and  three-dimensional  codes  are  Included. 
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Rarefied  Hypersonic  Flows 

A  wide  range  of  engineering  studies  associated  with  current  and 
projected  space  vehicles  are  concerned  with  the  aerothermodynamlcs  of 
low-density  gas  flows.  Flows  of  particular  Interest  can  arise  from 
interactions  between  two  or  more  of  the  following  events:  the  vehicle 
itself,  the  ambient  atmosphere,  exhaust  plumes  from  upper  stage  motors  or 
control  motors,  and  other  emitted  gases  from  material  outgassing  and 
waste  gas  venting.  Studies  concerned  with  these  interactions  are  receiving 
added  Impetus  by  the  Space  Shuttle  Orblter  flights,  the  commitment  of 
several  nations  to  pursue  the  goal  of  transatmospheric  flight  with 
hypersonic  slender  vehicles,  space  experiments,  technology  demonstration 
programs  such  as  the  Aeroassist  Flight  Experiment  (AFE)  vehicle  and  the 
Tethered  Satellite  System-2  (TSS-2),  the  projected  space  station,  and 
aeroassisted  space  transfer  vehicles  (ASTV's).  (See  Fig.  1.)  On-orbit  and 
high-altitude  flight  applications  occur  under  conditions  where  the  effects 
of  rarefaction  can  be  very  significant  in  terms  of  the  development  of  the 
flowfield  structure  that  envelopes  a  vehicle  or  spacecraft  and  the  momentum 
and  energy  transport  to  its  surface.  The  degree  of  rarefaction  is 
conventionally  expressed  through  an  overall  Knudsen  number  defined  by 

Kn  =  A./l  (1) 


where  A«  Is  the  mean  free  path  In  the  undisturbed  gas  and  l  is  some 
typical  dimension  of  the  flow.  Bird1  has  suggested  that  a  more  precise 
quantification  of  rarefaction  effects  should  be  based  on  a  local  Knudsen 
number  defined  as 


Kn  =  (A/a)  (3a/3x)  (2) 

where  A  is  the  local  mean  free  path,  x  Is  a  linear  dimension,  and  a  is  a 
macroscopic  flow  variable  such  as  density,  velocity,  or  temperature.  When 
the  value  of  the  local  Knudsen  number  approaches  0.1,  the  continuum 
formulation  as  modeled  by  the  Navier-Stokes  equations  becomes  suspect,  since 
the  Chapman-Enskog  expressions  for  viscosity,  heat  conduction,  and  diffusion 
coefficients  are  in  error.  In  fact,  the  Chapman-Enskog  expressions  become 
virtually  unusable  when  the  local  Knudsen  number  exceeds  0.2.  The  ranges  of 
validity,  in  terms  of  local  Knudsen  number,  of  the  equations  that  describe  a 
gas  flow  as  a  continuum  or  as  a  set  of  discrete  particles  are  shown  In 
Fig.  2.1 

Even  though  the  Boltzmann  equation  Is  the  classical  formulation  for 
describing  a  gas  as  a  set  of  Individual  particles,  this  equation  has 
remained  intractable  to  analytical  and  conventional  numerical  solution  for 
space-related  applications.  Applications  have  been  largley  restricted  to  a 
perfect  monatomic  gas  where  the  flow  Is  steady  and  one  dimensional.  The 
restriction  to  relatively  simple  flows  is  primarily  due  to  the  computational 
requirements  of  any  numerical  method  that  has  to  work  In  phase  space.  The 
addition  of  chemical  reactions  would  mean  that  the  Boltzmann  equation  would 
be  difficult  to  formulate,  let  alone  solve!  Futhermore,  almost  all 
space-related  applications  Involve  flows  with  at  least  two  spatial 
dimensions  and  a  three-dimensional  distribution  function  In  velocity  space. 
This  leads  to  a  five-dimensional  grid,  and  direct  numerical  solutions  can 
hardly  be  contemplated. 
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Because  of  the  limited  prospects  of  direct  numerical  solutions  of  the 
Boltzmann  equation  for  practical  space  applications,  an  alternate  approach 
has  been  developed.  The  alternative  to  a  formal  numerical  solution  is  to 
take  advantage  of  the  discrete  structure  of  the  gas  and  conduct  a  direct 
physical  simulation  of  the  flow  using  the  computer.  The  DSMC  method  is  such 
an  alternative,  and  Is  the  method  today  that  Is  most  readily  applied  to 
hypersonic  flow  problems  in  the  transitional  flow  flight  regime,  that  is, 
flow  problems  bounded  by  the  continuum  and  the  free  molecular  flow  flight 
regimes. 


DSMC  Method 

The  DSMC  method12'15  is  a  technique  for  the  computer  modelling  of  a 
real  gas  by  thousands  of  simulated  molecules.  The  velocity  components  and 
position  coordinates  of  these  molecules  are  stored  in  the  computer  and  are 
modified  with  time  as  the  molecules  are  concurrently  followed  through 
representative  collisions  and  boundary  interactions  in  simulated  physical 
space. 

For  a  simple  dilute  gas,  the  assumptions  used  in  implementing  the  DSMC 
method  are  consistent  with  the  assumptions  underlying  the  nonlinear 
Boltzmann  equation 

+  c  •  V(nf)  =  L(f,f),  (3) 

where  f(c,  t,  r)  is  the  molecular  distribution  function,  r  the  position 
vector,  t  the  time,  c  =  (u,  v,  w)  the  particle  velocity,  and  n(r,t)  the 
macroscopic  particle  number  density.  In  equation  (3)  the  left  side  is  the 
particle  movement  or  convection  operator,  and  L{ f , f )  is  the  collision 
operator  representing  changes  in  f  due  to  binary  collisions  between 
molecules.  In  DSMC  no  actual  use  of  equation  (3)  is  made,  and  the  phase 
space  (i.e.,  r,  c)  information  is  carried  directly  by  an  ensemble  of  a  few 
thousand  sample  molecules.  These  simulation  particles  or  molecules  exist 
within  a  framework  of  cells  and  at  any  instant  are  assumed  to  represent  a 
sample  from  f  within  the  phase  space  being  considered.  Starting  from  some 
initial  configuration,  DSMC  computes  the  evolution  of  the  sample  ensemble 
through  a  sequence  of  discretized  time  intervals  J  fit  (where  j  «  1,2,...). 
Under  certain  conditions  on  fit,  the  two  sides  of  equation  (3)  may  be 
decoupled,  and  each  may  be  simulated  alternately  in  the  time  sequence. 

During  the  convection  simulation,  each  particle  moves  in  a  free  trajectory 
in  fit  and  interacts  with  any  boundary  encountered  according  to  prescribed 
boundary  strategies.  For  the  simulation  of  the  right  side  of  equation  (3), 
a  statistical  interpretation  of  L { f , f )  leads  to  a  simple  algorithm  for  the 
calculation  of  sample  collisions  during  fit  in  each  of  a  set  of  spatial  cells 
which  represent  the  spatial  component  of  phase  space.  Statistical  estimates 
of  the  macroscopic  fluid  properties  or  the  surface  properties  (pressure, 
etc.)  represent  the  "solution"  to  the  flow  problem  and  are  obtained  by 
averaging  the  contributions  of  individual  particles  as  they  pass  through 
cells  or  strike  the  body  surface  during  the  calculation. 

The  assumptions  of  a  dilute  gas  (mean  spacing  between  molecules  Is  much 
larger  than  the  mean  particle  diameter)  and  molecular  chaos  are  common  to 
both  the  DSMC  method  and  the  Boltzmann  equation.  The  consistency  between 
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the  two  was  demonstrated  at  an  early  stage  of  development19  through  the 
derivation  of  the  Boltzmann  equation  from  the  DSMC  method,  and  the 
relationship  between  the  two  has  since  been  Investigated  in  some  detail  by 
Nanbu.20  The  objective  of  the  simulation  should  be  to  obtain  a  valid 
physical  model  of  the  real  gas  flow,  and  while  this  is  generally  equivalent 
to  a  solution  of  the  Boltzmann  equation,  the  DSMC  procedures  often  go  beyond 
the  limitations  of  the  Boltzmann  formulation.  For  example,  simulations  have 
Included  some  dense  gas  effects21  such  as  ternary  collisions.  Morever,  the 
method  is  routinely  applied  to  problems  involving  chemical  reactions1-9  and 
to  a  lesser  extent  problems  with  thermal  radiation. 3-5  These  effects  are 
also  beyond  the  current  formulations  of  the  Boltzmann  equation. 

Computational  Aspects 


The  principal  computational  tasks  associated  with  the  DSMC  method  are 
movement  of  the  molecules,  Indexing  molecules  into  cells,  molecular 
collisions,  and  sampling  of  f lowfield  and  surface  quantities.  (See  Fig.  3.) 
The  molecules  used  in  the  analyses  are  simulated  molecules,  each  of  which 
represents  a  very  large  number  (on  the  order  of  1019)  of  physical 
molecules.  Thus,  scaling  the  density  by  a  very  large  factor  has  the  effect 
of  substantially  reducing  the  number  of  molecular  trajectories  and  molecular 
collisions  that  must  be  calculated.  Remember,  however,  that  the  physical 
velocities,  molecular  size  (diameter  or  cross  section),  and  internal 
energies  are  preserved  in  the  simulation. 

Another  major  aspect  of  the  DSMC  method  is  the  uncoupling  of  the 
molecular  motion  and  molecular  collisions.  The  validity  of  this  dichotomy 
is  assured  by  requiring  that  the  computational  time  step  be  small  when 

compared  with  the  real  physical  collision  time  [At  <  (nocr)-1].  In 
addition  to  the  time  discretization,  a  cell  structure  is  required  for  two 
purposes:  first,  the  selection  of  potential  collision  pairs  and  second,  the 
sampling  of  flow  properties.  The  simulation  becomes  more  exact  as  the  cell 
size  and  time  step  tend  to  approach  zero. 

The  cell  dimensions  must  be  small  in  comparison  with  the  scale  length 
of  the  macroscopic  flow  gradients.  The  simulated  molecules  in  the  cell  are 
then  regarded  as  representative  of  the  real  molecules  at  the  location  of  the 
cell,  and  the  relative  location  of  the  molecules  within  the  cell  Is 
disregarded  In  the  selection  of  collision  partners.  It  Is  well  established2 
that  the  cell  size  must  be  small  in  comparison  with  the  local  mean  free  path 
in  regions  of  large  gradients.  For  problems  with  large  density  variations, 
the  use  of  variable  cell  sizes  assists  in  resolving  the  flow  gradients  and 
also  minimizing  the  computational  requirements  provided  that  the  flow  is 
steady.  Since  the  flow  Is  always  calculated  as  an  unsteady  flow  starting 
from  some  initial  specified  state  (usually  a  uniform  freestream  or  vacuum), 
any  steady  flow  becomes  the  large  time  state  of  the  unsteady  flow.  For 
boundary  conditions  where  the  flow  is  steady,  the  overall  computational 
effort  can  be  substantially  reduced  by  subdividing  the  flowfleld  into  an 
arbitrary  number  of  unis  (regions)  where  the  time  step  At  and  the  scaling 
factor  Fn  (the  number  uf  physical  molecules  represented  by  each  simulated 
molecule)  remain  constant  within  a  region,  but  can  vary  frum  region  to 
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region.  Of  course,  such  simulations  are  not  time  consistent  solutions,  but 
they  provide  steady  state  solutions  with  a  substantial  reduction  In 
computational  requirements.  The  combination  of  subdividing  the  flowfleld 
Into  regions  along  with  the  use  of  variable  cell  sizes  provides  the 
flexibility  to  substantially  reduce  the  total  number  of  molecules  used  in 
the  simulation  and  also  resolve  the  flow  gradients.  Recall  that  In  the  DSMC 
method  of  Bird,  the  procedures  are  specified  such  that  the  computational 
time  Is  linearly  dependent  on  the  number  of  molecules. 

There  is  some  lower  limit  on  the  number  of  simulated  molecules  per  cell 
because  the  cell-sampled  density  Is  used  in  the  procedures  for  establishing 
the  collision  rate.  It  Is  desirable  to  have  the  number  of  molecules  per 
cell  as  large  as  the  order  of  ten.  The  other  function  of  the  cell  besides 
sampling  is  the  selection  of  collision  pairs.  During  this  process.  It  Is 
desirable  to  reduce  the  mean  separation  distance  of  the  collision  pairs 
(pairs  are  selected  without  regard  to  position  within  the  cell)  and  thereby 
minimize  the  smearing  of  gradients.  Bird15  has  recently  addressed  this 
requirement  by  introducing  the  option  of  subdividing  the  sampling  cell  into 
an  arbitrary  number  of  sub-cells  for  the  selection  of  collision  pairs.  The 
sub-cells  are  chosen  to  contain  on  average  two  or  three  molecules,  so  that 
all  collisions  approach  the  "nearest-neighbor"  ideal.  Should  there  be  only 
one  molecule  in  a  sub-cell,  the  potential  collision  partner  is  selected  from 
an  adjacent  sub-cell  within  the  cell.  With  this  procedure,  the  molecular 
sampling  is  still  done  on  a  cell  basis,  while  the  collision  pairs  are 
selected  within  the  sub-cells.  The  computing  time  penalty  associated  with 
these  additional  procedures  is  negligible. 

The  basis  for  much  of  the  comments  on  the  DSMC  method  have  centered  on 
the  way  in  which  a  representative  set  of  collisions  are  selected  for  each 
cell  at  each  discrete  flow  time  step  At  so  that  the  appropriate  collision 
frequency  is  maintained.  Prior  to  1988,  the  collision  sampling  technique 
that  had  been  recommended  by  Bird12  was  the  "time-counter"  method  where 
advantage  is  taken  of  the  fact  that  the  computational  time  is  linearly 
proportional  to  the  number  of  molecules.  The  collision  pairs  are  accepted 
with  probability  proportional  to  the  product  of  the  magnitude  of  the 
relative  velocity  cr  and  the  total  collision  cross  section  o.  [This  is 
accomplished  by  normalizing  the  cro  product  by  the  maximum  value  that  has 
ever  occurred  within  the  particular  cell  and  then  using  an  acceptance  - 
rejection  procedure  (see  Appendix  D,  Ref.  12)  to  accept  or  reject  the 
collision  pair  that  was  selected  at  random.]  For  each  collision  pair 
selected,  a  "cell  time"  is  advanced  by 

2/ (N  <n>  o  cp)  (4) 

for  a  simple  gas  (see  page  121  of  Ref.  12  for  the  corresponding  expression 
for  inverse  power  law  molecules),  where  N  is  the  number  of  simulated 
molecules  In  the  cell,  and  <n>  is  the  time-averaged  (steady  flow)  or 
ensemble-averaged  (unsteady  flow)  number  density  in  the  physical  flow. 
Sufficient  collisions  are  calculated  to  keep  the  cell  collision  time 
concurrent  with  the  flow  time. 

In  Ref.  17,  Bird  Introduces  a  replacement  for  the  time-counter  method. 
The  new  method  is  called  the  "no  time  counter"  or  NTC  method,  and  he 
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strongly  recommends  It  for  all  applications.  A  problem  with  the 
time-counter  method  Is  that  the  acceptance  of  an  unlikely  collision  (one 
with  a  very  small  value  for  ocr)  can  advance  the  cell  time  by  an  Interval 
that  Is  much  larger  than  the  flow  time  step  At  and  the  overall  collision 
rate  can  be  distorted.  This  Is  parti cul ary  true  when  the  number  of 
simulated  molecules  N  In  a  cell  Is  small. 

The  NTC  method  is  obtained  by  modifying  the  "direct"  or  Kac  method.  In 
the  direct  method,  all  possible  pairs  In  each  cell  are  considered,  and  the 
probability  of  collision  within  the  time  step  Is  equal  to  the  ratio  of  the 
volume  swept  out  by  the  cross  section  (moving  with  the  relative  velocity)  to 
the  cell  volume  Vc.  The  disadvantage  Is  that  the  computation  time  Is  very 
nearly  proportional  to  the  square  of  the  number  of  molecules.  The  direct 
method  can  be  modified  by  reducing  the  number  of  sampled  pairs  by  some 
factor  and  Increasing  the  collision  probabilities  by  the  same  factor.  If 
the  factor  Is  such  that  the  maximum  collision  probability  of  any  pair  Is 
unity,  the  number  of  pairs  to  be  sampled  is 


0.5  N  <N>  Fn  (o  cr)(||ax  At/Vc  (5) 

and  the  collision  probability  for  each  selection  is 

(0  cr)  /  (o  cr)max  (6) 

The  selection  criterion  of  equation  (6)  Is  Identical  to  that  used  In  the 
time  counter  method.  The  only  change  with  this  method  is  that  the  number  of 
collision  pair  selections  Is  given  deterministically  [Equation  (5)]  rather 
than  probabilistically  through  the  operation  of  the  time  counter.  This 
removes  the  undesirable  correlation1'  between  unlikely  collisions  and 
collision  time  Intervals  that  can  occur  when  the  cell  time  counter  Is 
advanced  well  beyond  the  current  flow  time. 

Molecular  Models 


During  the  development  and  extension  of  the  DSMC  method,  there  has  been 
a  remarkable  Increase  In  the  gas  complexity  for  which  numerical  simulations 
are  possible.  The  modeling  has  advanced  from  a  simple  hard  sphere  model  to 
models  that  Include  Inelastic  effects  such  as  rotation,  vibration,  chemical 
reactions,  electronic  excitation,  and  radiation.  The  routines  used  to 
compute  the  molecular  Interactions  may  be  exercised  millions  of  times  during 
the  course  of  a  simulation,  and  It  Is  essential  for  them  to  be  brief.  In 
developing  a  model  and  Its  numerical  algorithm,  a  careful  balance  has  to  be 
struck  between  the  realism  of  the  physical  representation  and  the 
computational  efficiency. 

For  monatomic  gases,  the  variable  hard  sphere  (VHS)  model  Is 
recommended14’15  for  engineering  calculations.  This  model  was  selected 
based  on  the  accumulated  experience  that  the  effects  of  molecular  models  can 
be  correlated  with  the  variation  of  the  differential  cross  section  of  the 
molecules,  which  Is  a  function  of  the  relative  velocity  In  the  collision. 

The  VHS  model  Is  essentially  a  hard  sphere  with  a  diameter  that  varies  as 
some  Inverse  power  of  the  relative  velocity  In  the  collision.  This  is  the 
simplest  model  that  Is  capable  of  modeling  the  viscosity  coefficient  of  real 
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molecules.  A  discussion  of  the  VHS  model  and  its  overall  applicability 
relative  to  other  interaction  models  (both  inverse  power  models  to  which  the 
VHS  model  belongs  and  models  which  incorporate  long  range  attractive  forces, 
i.e.,  the  Lennard-Jones  (6-12),  Buckingham  exp-6,  and  Morse  models)  is  given 
in  Ref.  22. 

Reference  15  gives  a  brief  summary  of  the  modeling  currently 
implemented  to  describe  Internal  degrees  of  freedom  (rotation  and 
vibration),  chemical  reactions,  electronic  excitation,  and  radiation. 
Reference  3  outlines  In  detail  the  modeling  that  is  currently  Implemented 
for  describing  the  effects  of  partial  ionization,  electronic  excitation,  and 
thermal  radiation— effects  that  become  Important  for  very  high  entry 
velocities  such  as  experienced  by  ASTV's  in  the  transitional  flow  regime. 

Surface  Interactions 


As  discussed  by  Harvey22,  for  most  applications  the  distribution  of 
molecules  reflected  from  engineering  surfaces  appears  to  correspond  closely 
to  the  diffuse  and  fully  thermally  accommodated  pattern.  Evidence  contrary 
to  this  has  been  reported23  based  on  observations  from  upper  atmospheric 
flight  measurements.  These  observations  are  now  in  question  based  on  the 
results  of  recent  DSMC  simulations.24-26  The  DSMC  simulations  show  that 
transitional  effects  persist  at  higher  altitudes  than  had  been  assumed  In 
the  Interpretation  of  the  flight  measurements.  As  a  consequence,  It  appears 
that  the  aerodynamic  characteristics  can  be  explained  In  terms  of  a  diffuse 
Interaction  with  full  thermal  accommodation  when  transitional  effects  are 
included  rather  than  resorting  to  a  combination  of  diffuse  and  specular 
interactions  while  assuming  free  molecular  flow. 

The  effects  of  deviation  from  the  diffuse  model  with  full  thermal 
accommodation  have  been  studied  with  DSMC  calculations  by  applying  the 
Maxwell  boundary  condition— a  linear  combination  of  fully  accommodated 
diffuse  and  specular  reflections— which  Is  physically  unrealistic. 

Numerous  attempts22  have  been  made  to  find  more  satisfactory  ways  of 
predicting  the  gas-surface  Interaction  which  range  from  simple  empirical  to 
complex  quantum  lattice  models.  In  general,  none  of  these  models  perform 
well,  and  most  are  too  complicated  for  Inclusion  In  DSMC  calculations. 

Other  than  the  Maxwell  model,  the  only  alternative  that  has  been  tested  In  a 
simulation  Is  a  modification  of  the  Noc 111a  drifting  Maxwellian  model.11 

Comparisons  with  Experiments 

When  the  DSMC  calculations  are  performed  carefully  (particular 
attention  Is  given  to  the  numerical  requirements  of  cell  size  and  time  step 
and  to  the  Interaction  modelling  of  the  viscosity  coefficient  of  the  real 
molecules),  the  method  appears  to  yield  results  that  agree  very  precisely 
with  experiments.27-33  For  example,  Harvey  and  associates27-28  have  made 
numerous  comparisons  between  experiments  and  DSMC  results  primarily  for 
hypersonic  nitrogen  flows  where  Internal  translational  energy  exchange  Is  a 
feature  of  Importance.  Results  of  comparisons  for  surface  forces,  heat 
transfer,  and  flowfleld  profiles  (density  and  rotational  temperature)  have 
been  reported  for  flow  about  sharp  as  well  as  blunt  configurations. 
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In  Ref.  30.  Fiscko  and  Chapman  calculated  the  one-dimensional 
shock-wave  structure  for  argon  and  showed  that  the  calculated  shock  density 
thickness  was  In  good  agreement  with  the  measured  values  of  Alsmeyer.34  A 
more  recent  study31"33  also  Investigated  the  shock  wave  structure  of  argon 
(Mach  7)  and  helium  (Mach  1.59,  20,  and  25)  flows  using  the  DSMC  method.  In 
these  studies,  the  comparisons  between  calculation  and  experiment  are  done 
at  a  very  fundamental  level  In  that  the  comparisons  are  made  for  the 
molecular-velocity  distribution  function  In  the  shock  wave.  The 
measurements  (Fig.  4)  of  the  molecular  velocities  inside  a  hypersonic  normal 
shock  wave,  where  the  gas  experiences  rapid  changes  In  macroscopic 
properties,  show  a  highly  nonequilibrium  molecular  motion  (translational 
nonequilibrium)  and  a  blmodal  velocity  distribution  In  the  direction 
parallel  to  the  flow.  For  the  experimental  conditions,  the  DSMC  method  of 
Bird  provided  accurate  quantitative  prediction  of  the  molecular  motion. 

These  calculations  and  comparisons  with  measurements  represent  an  important, 
detailed  test  of  the  DSMC  method  for  elastic  Interatomic  collisions. 

Figure  4  presents  a  comparison  of  the  calculated  and  experimental 
velocity  distributions  at  three  locations  (n  denotes  the  fraction  of  the 
density  rise  across  the  shock  wave)  within  the  normal  shock  wave.  Shown  are 
the  velocity  distributions  both  parallel  and  normal  to  the  flow  for  Mach  25 
helium.  Since  the  flow  was  produced  In  a  free-Jet  expansion,  the  freestream 
was  not  In  equilibrium  (temperature  perpendicular  to  the  flow  was  about 
1.1  K  while  that  parallel  to  the  flow  was  2.2  k),  and  this  fact  was 
accounted  for  In  the  simulation.  The  calculated  results  shown  in  Fig.  4 
used  the  Maltland-Smlth  Intermolecular  potential35  with  a  distance  parameter 
of  2.976  A  and  a  well  depth  of  10.9  K.  Similar  results  using  the  VHS  model 
for  Mach  20  helium  are  presented  In  Ref.  31. 

The  previously  cited  comparisons  are  examples  of  recent  efforts  to 
validate  the  DSMC  technique  by  experiment.  There  have  also  been  several 
studies  at  hypersonic  flight  conditions  where  qualitative  validation  is 
attempted  by  comparing  with  either  limited  flight  measurements  (Refs.  2,  25, 
and  36.)  or  with  continuum  methods  (Refs.  2,  7,  8,  and  37-41)  for  moderate 
to  small  Knudsen  number  flows. 

Application  of  DSMC  to  Hypersonic  Flows 

This  section  will  briefly  summarize  the  results  of  five  application 
studies  that  have  been  performed  at  the  NASA  Langley  Research  Center  using 
the  DSMC  method.  Four  of  the  studies  are  concerned  with  flight  applications 
(blunt  slender  bodies.  Shuttle  Orblter,  TSS-2,  and  AFE)  that  correspond  to 
the  conditions  shown  In  Fig.  1,  whereas  the  fifth  has  been  conducted  for 
future  comparisons  with  measurements  performed  in  a  hypersonic  wind  tunnel. 

Blunt  Slender  Body  Calculations 

Flows  about  cylindrical ly  blunted  wedges  (2-D)  and  spherically  blunted 
cones  with  body  half  angles  of  0*.  5*,  and  10*  were  calculated7-8  with  the 
OSMC  method  for  entry  conditions  (V.  »  7.5  km/s,  altitude  range  of  110  to 
70  km).  For  a  nose  radius  of  2.54  cm,  the  transitional  flow  effects  persist 
below  70  km  and  are  Important  In  defining  the  heating  to  the  leading  edges 
of  slender  vehicles  such  as  NASP.  This  Is  demonstrated  by  comparing 
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the  DSMC  calculations  with  continuum  calculations  using  a  viscous 
shock-layer  (VSL)  method.  Both  the  DSMC  and  VSL  calculations  Include  a 
five-species  reacting  air  gas  model,  a  constant  wall  temperature  of  1,000  K, 
and  a  finite  catalytic  wall.  Results  from  this  study  are  presented  In 
Figs.  5-7.  Figures  5  and  6  show  the  effect  of  the  body  configuration  (2-D 
versus  axlsymmetrlc)  on  flowfleld  composition  and  stagnation-point  heating 
rate.  The  extent  of  the  flowfleld  disturbance  Is  greater  for  the  2-D  flow 
than  the  axlsymmetrlc  case.  This  Impacts  the  amount  of  dissociation  In  the 
flow  as  Is  clearly  demonstrated  In  Fig.  5  where  the  maximum  atomic  mass 
fractions  are  shown  for  both  configurations  as  a  function  of  altitude. 

The  stagnation  point  heat-transfer  coefficient  as  a  function  of  Cheng's 
parameter  (p-Rn/p«U.C*)  Is  presented  in  Fig.  6  for  both  the  2-D  and 
axlsymmetrlc  bodies.  Qualitatively,  the  results  are  what  one  would  expect: 
an  Increase  from  the  small  value  near  the  continuum  regime  to  a  value  of 
unity  as  the  free-molecule  limit  Is  approached.  For  the  range  of  freestream 
conditions  considered,  the  2-D  heat-transfer  coefficient  Is  always  lower 
than  the  corresponding  axlsymmetrlc  value. 

The  continuum  VSL  calculations  were  made  for  the  5*  cone  using  the 
no-slip  boundary  conditions.  Calculations  at  altitudes  of  50,  60,  70,  and 
80  km  were  made  with  the  overlap  between  the  VSL  and  DSMC  calculations  being 
70  and  80  km.  Figure  6  shows  the  extent  of  the  agreement  between  the  two 
methods  for  stagnation  point  heat  transfer.  The  continuum  results  begin  to 
depart  significantly  from  the  DSMC  data  above  70  km,  and  the  same  Is  true 
for  drag,  which  Is  not  shown.  If  slip  boundary  conditions  had  been  used  In 
the  VSL  calculations,  better  agreement  at  the  more  rarefied  conditions  would 
have  occurred. 

Even  through  the  computed  stagnation  point  heat-transfer  rates  differ 
by  only  15  percent  at  70  km,  there  are  significant  differences  between  the 
predicted  flowfleld  structure  along  the  stagnation  streamline,  partlculary 
downstream  of  the  stagnation  region.  Figure  7  presents  a  comparison  of  the 
density  profiles  along  the  stagnation  streamline.  With  respect  to  the  VSL 
data,  the  DSMC  calculations  show  that  the  upstream  Influence  of  the  body  Is 
more  than  three  times  that  predicted  by  the  VSL  calculation.  The 
shock-layer  thickness  calculated  with  the  VSL  method  Is  only  about  two 
freestream  mean-free  paths  (A.  =  9.0X10’*  m)  in  thickness.  The  DSMC 
results  are  qualitatively  what  one  would  expect,  since  a  freestanding  normal 
shock  wave  Is  about  five  mean-free  paths  In  thickness.  (Note  that  the  data 
points  shown  for  the  DSMC  calculation  are  only  a  partial  set,  partlculary 
near  the  wall.)  The  differences  shown  In  the  density  adjacent  to  the 
surface  can  be  explained  In  part  by  the  temperature  jump  (537  X)  calculated 
with  the  DSMC  method  and  some  differences  In  gas  composition  adjacent  to  the 
wall. 

Aerodynamics  of  Shuttle  Orblter 

Accurate  predictions  of  aerothermal  loads  during  entry  can  be  very 
Important  for  the  design  and  development  of  hypersonic  space  vehicles.  A 
portion  of  the  reentry  for  these  vehicles  takes  place  In  the  transitional 
flow  flight  regime  where  the  various  nonequilibrium  effects  become  Important 
In  establishing  the  thermal  and  aerodynamic  response  of  these  vehicles.  In 
order  to  simplify  the  computational  requirements,  the  aerothermal  loads  for 
vehicles  such  as  the  Space  Shuttle  Orblter  are  often  approximated*2-43  with 
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a  flat  plate  at  Incidence  for  the  free-molecular  flow  regime.  For  the 
transitional  flow  regime,  empirical  approximations  are  normally  used  to 
calculate  these  loads.42-44  It  has  been  observed  from  the  Space  Shuttle's 
flight  experiments  that  measured  values  of  lift-drag  ratio  are  considerably 
higher  than  the  free-molecular  flow  calculations  at  altitudes  of  160  km  and 
above  where  the  flow  regime  was  previously  believed  to  be  free-molecular. 
This  discrepancy  was  thought  to  be  due  to  specular  reflection  of  some 
fraction  of  the  molecules  at  the  surface.  As  early  as  1985,  It  was 
recognized45  that  transitional  effects  rather  than  specular  gas-surface 
Interaction  might  be  Influencing  the  Interpretation  of  the  flight 
measurements;  however,  no  calculations  were  available  to  establish  this 
hypothesis. 

Recent  direct  simulation  Monte  Carlo  (DSMC)  calculations24-25  of  the 
rarefied  flow  past  flat  plates  at  Incidence  were  the  first  to  show  that  the 
transitional  effects  persist  for  the  Space  Shuttle  Orblter  even  at  altitudes 
(160  km  and  above)  where  the  flow  had  previously  been  considered  as 
free-molecular.  For  the  calculations  of  Ref.  25,  two  12-m  flat  plates  at 
40“  Incidence  were  used  to  simulate  the  freestream  Knudsen  number  of  the 
Space  Shuttle  Orblter  during  entry.  One  plate  had  zero  thickness,  and  the 
second  had  a  thickness  of  0.5  m  and  a  blunted  leading  edge  (nose  radius  = 

0.5  m).  Both  plates  were  12  m  In  length,  which  corresponds  to  the  mean 
aerodynamic  cord  of  the  Shuttle  Orbiter's  wings.  DSMC  calculations  were 
made  for  an  altitude  range  of  200  to  100  km  at  7.5  km/s  using  a  five-species 
reacting  air  model. 

The  surface  temperature  Is  assumed  to  be  constant  along  the  surface  and 
equal  to  the  wall  radiative  equilibrium  value  on  the  windward  side 
(evaluated  with  free-molecular  heating  and  a  surface  emittance  of  0.09). 
Also,  the  wall  Is  assumed  to  be  diffuse  with  full  thermal  accommodation  and 
to  promote  recombination  of  the  oxygen  and  nitrogen  atoms. 

Figure  8  presents  the  calculated  lift  and  drag  coefficients  for  the 
flat  plate  as  a  function  of  freestream  Knudsen  number  along  with  the 
corresponding  free-molecular  results.  These  results  show  the  expected 
variation  In  the  transitional  flow  regime.  The  drag  coefficient  increases 
and  the  lift  coefficient  decreases  substantially  with  increasing 
rarefaction;  both  approach  the  free-molecular  limit. 

Figure  9  presents  the  lift-drag  ratio  as  a  function  of  freestream 
Knudsen  number  for  both  plates.  Even  at  a  Knudsen  number  of  16  (altitude  = 
200  km),  the  llft-to-drag  ratio  has  not  attained  the  free-molecular  value. 
Results  such  as  these  have  Important  Implications  for  the  Interpretation  of 
flight  measurements  used  to  deduce  aerodynamic  coefficients  under  rarefied 
conditions.  At  altitudes  of  160  km  and  above,  the  conventional 
procedure42-44  has  been  to  Interpret  the  flight  measurements  using  the 
free-molecular-flow  calculations.  Such  procedures  have  been  used  to 
establish  what  fraction  of  the  gas-surface  Interaction  is  specular.  As 
the  fraction  of  specular  reflection  Increases,  the  lift-drag  ratio  also 
Increases  for  a  given  Incidence  angle.  Since  these  two  separate  effects 
both  produce  Increased  lift-drag  ratio,  Interpretation  of  flight 
measurements  must  account  for  the  transitional  effects. 

Tethered  Satellite  Flowfleld  Characterization 

The  Tethered  Satellite  System-2  (TSS-2)  Is  being  proposed  as  a 
cooperative  effort  of  the  National  Aeronautics  and  Space  Administration  of 
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the  United  States  and  the  Agenzla  Spazlale  Itallana  of  Italy.  For  this 
mission,  the  Shuttle  Orblter  would  be  used  to  demonstrate  a  tethered 
satellite  system  In  a  downward  deployment  and  retrieval  of  a  500-kg,  1.6-m- 
dlameter  spacecraft  attached  to  the  end  of  a  100-km  tether.  The  tethered 
spacecraft  could  reach  downward  Into  the  outer  atmosphere  of  the  Earth  to 
altitudes  of  130  km  for  TSS-2  and  later  perhaps  to  90  km.  One  of  the 
objectives  of  the  TSS-2  mission  Is  to  conduct  hypersonic  research  in  the 
transitional  flow  regime. 

Initial  calculations  using  the  DSMC  method  have  been  made  by  Wilmoth46 
for  a  1.6-m-dlameter  sphere.  Figures  10  and  11  show  selected  contours  of 
nondlmensional  density  and  temperature,  respectively.  These  results  are  for 
a  130-km  altitude  and  a  freestream  velocity  of  7.5  km/s.  The  gas  Is  a  five- 
species  reacting  air  model,  and  the  wall  temperature  Is  constant  at  350  K. 

At  130-km,  the  freestream  Knudsen  number  based  on  spacecraft  diameter  is 
4.8.  The  flow  Is  in  the  transitional  regime,  since  the  drag  coefficient  is 
only  95  percent  of  the  free-molecular  value  of  2.1.  In  addition,  there  is 
negligible  dissociation  occurring  at  this  altitude;  however,  there  is  clear 
evidence  of  thermal  diffusion  (not  shown)  where  the  concentration  of  atomic 
oxygen  (from  the  freestream)  decreases  near  the  surface  and  the 
concentration  of  the  more  massive  oxygen  and  nitrogen  molecules  Increases. 

As  discussed  In  Ref.  47,  thermal  diffusion  acts  to  concentrate  the  heavy  gas 
In  the  cooler  regions  of  the  flow  (adjacent  to  the  surface).  Calculations 
such  as  these  are  useful  In  defining  the  range  of  flow  parameters  for  which 
measurements  could  be  made.  Furthermore,  the  potential  flight  measurments 
would  be  useful  In  validating  the  computational  tools. 

Transitional  Flow  about  the  AFE 

A  side  view  of  the  AFE  vehicle  is  shown  In  Fig.  12.  The  aerobrake  is 
an  elllptlcally  blunted  elliptic  cone  raked  off  at  the  base  and  fitted  with 
a  skirt-type  afterbody.  The  three-dimensional  configuration  has  a  base 
length  of  4.25  m. 

Figure  13  shows  the  computational  grid  used  to  simulate  the  3-D  flow 
for  the  120-km-altltude  case.  In  this  figure,  both  cells  and  subcells  are 
shown  on  the  outer  freestream  boundary.  (For  the  present  three-dimensional 
application,  the  cells  are  deformed  hexahedra,  and  each  cell  Is  further 
divided  Into  tetrahedral  subcells.)  However,  on  the  plane  of  symmetry,  only 
the  cell  structure  Is  drawn  for  clarity.  Only  the  forebody  and  the 
experimental  carrier  are  Included  In  the  calculation,  since  the  solid  rocket 
motor  Is  ejected  during  entry  near  130  km. 

Reference  26  describes  In  some  detail  the  highly  nonequilibrium  flow 
that  surrounds  the  AFE  vehicle  at  these  high-altitude  conditions  (100  to 
200  km)  and  the  resulting  surface  pressure  and  heat  transfer  distributions. 
The  results  of  this  study  show  that  dissociation  Is  Important  at  110-km 
altitude  and  below  (a  five-species  gas  model  was  used)  and  that  the  flow 
approaches  the  free-molecular  limit  very  gradually  at  higher  altitudes. 

Even  at  200  km,  the  flow  Is  not  completely  collisionless.  This  Is  clearly 
evident  In  Fig.  14  where  the  llft-to-drag  ratio  Is  presented  at  selected 
altitudes  for  an  angle  of  Incidence  of  0*  (using  the  present  coordinate 
system  shown  In  Fig.  12).  Figure  14  also  shows  the  calculated  free-molecule 
and  modified  Newtonian  results,  along  with  experimental  wind-tunnel  data. 

The  experiments  were  conducted  at  the  NASA  Langley  Research  Center  Mach  10 
air  and  Mach  6  CF4  (freon)  wind  tunnels  using  high-fidelity  models.48 
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Clearly,  the  DSMC  results  approach  the  free-molecule  limit  very  slowly  at 
higher  altitudes,  and  even  at  an  altitude  of  200  km,  the  flow  Is  not 
completely  collisionless.  Prior  to  this  study.  It  was  generally 
acknowledged  that  free-molecule  flow  existed  for  the  AFE  vehicle  for 
altitudes  near  the  150  km,  but  this  study  shows  that  the  transitional 
effects  are  significant  at  these  altitudes  and  Influence  the  overall 
aerodynamic  coefficients. 

Figure  14  also  contains  the  results  of  the  Lockheed  bridging  formula 
which  empirically  connects  the  axial  and  normal  aerodynamic  force 
coefficients  between  the  continuum  and  free-molecule  limits.  This  is 
accomplished  with  a  sine-square  function  by  assuming  continuum  flow  at  a 
Knudsen  number  Kn„  =  0.01  and  free-molecule  flow  at  Kn.  =  10,  which 
corresponds  to  altitudes  of  90  and  150  km,  respectively.  The  bridging 
formula  results  are  plotted  to  show  the  general  trend  even  though  they  are 
erroneous  for  the  conditions  considered  in  the  present  study. 

Rarefied  Flow  about  a  Delta  Wing 

In  the  study  of  Ref.  49,  a  general  three-dimensional  (3-D)  DSMC  code  is 
used  to  simulate  a  rarefied  flow  about  a  delta  wing  (Fig.  15).  As  shown  in 
this  figure,  the  top  of  the  wing  Is  flat,  the  bottom  Is  V-shaped,  and  the 
edges  are  rounded  with  a  constant  radius,  R  »  0.0013  m.  The  shape  of  the 
nose  from  the  side  view  Is  elliptical  although  It  appears  sharp  from  the  top 
view.  The  origin  of  the  coordinate  system  Is  located  at  the  tip  of  the 
nose,  and  the  x  axis  is  parallel  to  the  top  surface  and  Is  normal  to  the 
base  plane.  The  top  surface  Is  Inclined  30°  away  from  the  freestream.  The 
vertical  midplane  (at  z  *  0)  Is  a  plane  of  symmetry;  thus  computations  have 
been  made  for  half  of  the  geometry.  Figure  16(a)  shows  a  perspective  view 
for  the  3-0  computational  grid  used  In  this  study,  and  Fig.  16(b)  shows  the 
grid  structure  at  the  aft  end.  The  body-fitted  grid  has  a  total  of  5,280 
cells. 


The  flow  simulated  in  this  study  Is  a  wind-tunnel  experiment  in  which 
the  flowfleld  freestream  conditions  are  T.  =  13.32  K,  V»  =  1503  m/s, 

M.  =  20.2,  and  p.  »  1.729  x  10~5  kg/m3.  According  to  the  VHS  collision 
model  (with  Tref  ■  300  K,  dref  =  4.07  x  10-1“  m,  and  the  temperature 
exponent  of  the  viscosity  coefficient  of  0.75),  the  calculated  freestream 
mean  free  path  and  viscosity  are  0.00159  m  and  1.9  x  10'6  N«s/m2, 
respectively.  Hence,  the  overall  Knudsen  number  (based  on  the  body  length) 
Is  0.0159,  and  the  freestream  Reynolds  number  per  meter  is  14,000.  The  body 
surface  Is  specified  to  be  at  a  uniform  temperature  of  620  K.  Full  thermal 
accommodation  and  diffuse  reflection  are  assumed  for  the  gas-surface 
Interaction.  Simulations  are  performed  using  a  nonreacting  gas  model  with 
one  chemical  species  (N2)  while  considering  energy  exchange  between 
translational  and  Internal  (rotational  and  vibrational)  modes. 

The  computations  are  performed  for  a  total  of  9,000  time  steps.  A 
stationary  state  was  reached  around  1,000  time  steps,  and  after  that, 
flowfleld  samples  are  taken  every  other  time  step.  Hence,  the  time-averaged 
flowfleld  results  presented  in  this  study  are  based  on  sample  sizes  derived 
from  4,000  samplings. 
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Figure  17  presents  the  density  flowfleld  contours  along  the  symmetry 
plane  [17(a)]  and  at  a  cross-sectional  plane  located  at  the  80-percent  cord 
location  [17(b)],  and  the  same  contour  levels  are  shown  In  both  parts.  The 
flowfleld  results  are  obtained  at  the  centroids  of  the  cells  and  hence  do 
not  extend  to  the  boundaries  (one  half  cell  from  the  boundaries). 
Accordingly,  the  flowfleld  contours  for  the  symmetry  plane  are  actually  one 
half-cell  from  the  symmetry  plane.  (See  Ref.  49  for  similar  Information 
regarding  flowfleld  contours  of  Mach  number,  various  temperatures,  and 
surface  quantities.) 

This  study  represents  one  of  the  first  applications  of  the  3-D  DSMC 
method  to  a  flow  about  a  relatively  sharp-nosed  body.  The  computations 
indicate  that  the  leeside  flow  Is  attached,  and  the  results  will  be  compared 
with  the  leeside  density  profiles  and  total  body  measurements  (Allegre55  and 
his  co-workers  at  CNRS  In  France)  when  available.  Further  computations 
related  to  downstream  effects  are  needed  to  have  a  better  understanding  of 
the  overall  flowfield  structure. 

Future  Research  Activities 


The  previous  examples  of  application  of  the  DSMC  method  to  hypersonic 
external  flow  problems  represent  only  a  very  limited  portion  of  the  wide 
spectrum  of  current  applications.  Examples  of  other  applications  are 
spacecraft  contamination51-52  resulting  from  the  expansion  of  gases  out  of  a 
rocket  nozzle,  the  study  and  characterization  of  pumping  devices,5-1-54  and 
studies  in  materials  processing55  concerning  thin-film  vapor  deposition. 
Because  of  the  central  importance  of  codes  in  predicting  rarefied  flows, 
future  activities  will  focus  on  developing  codes  that  are  computationally 
more  efficient  and  easier  to  use,  Improving  on  the  existing  physical 
modeling,  and  performing  experiments  that  can  be  used  to  validate  existing 
modeling  or  provide  the  data  base  essential  for  new  models  such  as  that 
needed  for  gas-surface  interactions  and  energy  exchange  mechanisms. 

Efforts  by  a  number  of  researchers  are  currently  being  made  to 
Implement  means  of  faster  execution  time  while  minimizing  storage 
requirements.  A  major  problem  of  the  DSMC  method  has  been  the  large  amounts 
of  computing  time  required  for  relatively  simple  problems.  More  attention 
has  recently  been  focused  on  ways  of  reducing  the  computing  time,  including 
development  of  new  algorithms  that  take  advantage  of  current  supercomputer 
architectures.55-57  An  approach  that  is  currently  being  pursued  by  a  number 
of  researchers  is  the  application  of  parallel  processing.55-59  Reductions 
of  up  to  twelvefold  using  16  processors55  and  up  to  sixteenfold  using  32 
processors59  have  been  reported. 

Most  of  the  computational  time  required  by  existing  DSMC  codes  Is  taken 
up  by  the  analytical  geometry  associated  with  the  description  of  complex 
flows.  A  new  flow  definition  system55  is  under  development  that  cuts  this 
time  by  a  factor  of  twenty.  Consequently,  the  potential  for  major  reduction 
in  the  execution  time  of  the  DSMC  method  Is  real,  particularly  with  the 
synergism  associated  with  Improved  algorithms,  vectorizatlon,  and  parallel 
processing. 
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In  the  area  of  physical  model  development,  Refs.  61  and  62  are  examples 
of  recent  studies  concerned  with  the  modeling  of  various  physical  processes 
(energy  exchange  between  the  translational  and  Internal  modes  In  Ref.  61  and 
the  way  In  which  electric  field  effects  are  modeled  In  Ref.  62).  The  needs 
for  experimental  work  In  rarefied  gas  dynamics  as  It  relates  to  DSMC 
validation  are  reviewed  by  Muntz  In  Ref.  16. 

Concluding  Remarks 

A  review  of  the  DSMC  method  with  regard  to  what  It  is,  the  capability 
that  It  provides  for  analyzing  transitional  flows,  applications  that  focus 
on  hypersonic  external  flows,  and  current  and  future  research  activities  has 
been  presented.  Major  accomplishments  have  occurred  in  the  past  five  years 
In  developing  the  capability  to  predict  highly  nonequilibrium  reacting 
flowflelds  along  with  the  effects  of  Ionization  and  radiation.  Considerable 
efforts  are  now  being  focused  on  Improving  Its  computational  efficiency  and 
on  code  validation.  The  preliminary  results  of  both  activities  are 
extremely  encouraging  as  Indicated  in  the  present  review.  With  more 
efficient  codes,  the  range  of  applications  and  problem  complexity  will 
Increase.  Consequently,  it  Is  important  that  experiments  be  performed  that 
provide  information  from  which  various  aspects  of  the  DSMC  method  can  be 
val 1  dated. 


26 


References 


^Ird,  G.  A.,  “Low-Density  Aerodynamics,"  Progress  In  Astronautics  and 
Aeronautics:  Thermophvslcal  Aspects  of  Re-entry  Flows,  edited  by  J.  N.  Moss 
and  C.  D.  Scott,  Vol.  103,  1986,  pp.  3-24. 

^Moss,  J.  N.  and  Bird,  G.  A.,  "Direct  Simulation  of  Transitional  Flow 
for  Hypersonic  Reentry  Conditions,"  Progress  In  Astronautics  and 
Aeronautics:  Thermal  Design  of  Aeroasslsted  Orbital  Transfer  Vehicles, 
edited  by  H.  F.  Nelson,  Vol.  96,  1985,  pp.  113-139. 

3B1rd,  G.  A.,  “Nonequilibrium  Radiation  During  Re-entry  at  10  km/s," 
AIAA  Paper  87-1543,  June  1987. 

4Moss,  J.  N.,  Bird,  G.  A.,  and  Dogra,  V.  K. ,  "Nonequilibrium  Thermal 
Radiation  for  an  Aeroassist  Flight  Experiment  Vehicle,"  AIAA  Paper  88-0081, 
January  1988. 

5Moss,  J.  N.  and  Price,  J.  M.,  "Dtrect  Simulation  of  AFE  Forebody  and 
Wake  Flow  with  Thermal  Radiation,"  Progress  In  Astronautics  and  Aeronautics: 
Rarefied  Gas  Dynamics:  Theoretical  and  Computational  Techniques,  edited  by 
E.  P.  Muntz,  D.  P.  Weaver,  and  D.  H.  Campbell,  Vol.  118,  1989,  pp.  413-431. 

6Chapman,  D.  R.,  Flscko,  K.  A.,  and  Lumpkin,  F.  E.,  "A  Fundamental 
Problem  In  Computing  Radiating  Flow  Fields  with  Thick  Shock  Waves,"  SPIE 
Proceeding  on  Sensing.  Discrimination  and  Signal  Processing,  and 
Superconducting  Materials  and  Instrumentation.  Vol.  879,  1988,  pp.  106-112. 

7Cuda,  V.  and  Moss,  J.  N. ,  "Direct  Simulation  of  Hypersonic  Flows  Over 
Blunt  Wedges,"  AIAA  Journal  of  Thermophvslcs  and  Heat  Transfer.  Vol.  1  April 
1987,  pp.  97-104. 

8Moss,  J.  N.  and  Cuda,  V.,  "Nonequilibrium  Effects  for  Hypersonic 
Transitional  Flows,"  AIAA  Paper  87-0404,  January  1987. 

8B1rd,  G.  A.,  "Direct  Simulation  of  Typical  AOTV  Entry  Flows,"  AIAA 
Paper  86-1310,  June  1986. 

l°A11egre,  J.  Raffin,  M.,  and  Gottesdlener,  L.,  "Slip  Effects  on 
Supersonic  Flowflelds  Around  NACA  0012  Airfoils,”  Proceedings  of  the  15th 
International  Symposium  on  Rarefied  Gas  Dynamics,  edited  by  V.  Boff  and  C. 
Cerclgnanl,  Vol.  1,  1986,  pp.  548-557. 

^Hurlbut,  F.  C.,  "Sensitivity  of  Hypersonic  Flow  Over  a  Flat  Plate  to 
Wall/Gas  Interaction  Models  Using  DSMC,"  AIAA  Paper  87-1545. 

l^Blrd,  G.  A.,  Molecular  Gas  Dynamics.  Clarendon  Press,  Oxford,  1976. 

13B1rd,  G.  A.,  "Monte  Carlo  Simulation  of  Gas  Flows,"  Annual  Reviews  of 
Fluid  Mechanics.  Vol.  10,  edited  by  M.  D.  Van  Dyke,  J.  V.  Wehausen,  and 
J.  L.  Lumley,  Annual  Reviews  Inc.,  Palo  Alto,  CA,  1979,  p.  11 

14B1rd,  G.  A.,  "Monte  Carlo  Simulation  In  an  Engineering  Context," 

AIAA  Progress  In  Astronautics  and  Aeronautics:  Rarefied  Gas  Dynamics.  Vol. 
74,  Part  1,  edited  by  S.  S.  Fisher,  AIAA,  New  York,  1981,  pp. 


27 


^Bird,  G.  A.,  "Direct  Simulation  of  Gas  Flows  at  the  Molecular  Level,” 
Communications  in  Applied  Numerical  Methods.  Vol.  4,  1988,  pp.  165-172. 

16Muntz,  E.  P.,  "Rarefied  Gas  Dynamics,”  Annual  Reviews  of  Fluid 
Mechanics.  Vol.  21,  edited  by  J.  L.  Lumley,  M.  D.  Van  Dyke,  and  H.  L.  Reed, 
1989,  pp.  387-417. 

17B1rd,  G.  A.,  "Perception  of  Numerical  Methods  in  Rarefied 
Gasdynamlcs,"  Progress  in  Astronuatics  and  Aeronautics:  Rarefied  Gas 
Dynamics:  Theoretical  and  Computational  Techniques,  edited  by  E.  P.  Muntz, 
D.  P.  Weaver,  and  D.  H.  Campbell,  Vol.  118,  1989,  pp.  221-226. 

*3Alder,  B.  J.,  and  Wainwrlght,  T.  E.,  "Molecular  Dynamics  by  Electronic 
Computer,"  Transport  Processes  in  Statistical  Mechanics,  edited  by 
I.  Prigogine,  Interscience,  New  York,  1958,  pp.  97-131. 

19Bird,  G.  A.,  "Direct  Simulation  and  the  Boltzmann  Equation,"  The 
Physics  of  Fluids.  Vol.  13,  No.  11,  Nov.  1970,  pp. 2676-2681. 

20Nanbu,  K.,  "Theoretical  Basis  of  the  Direct  Simulation  Monte  Carlo 
Method,"  Proceedings  of  the  15th  International  Symposium  on  Rarefied  Gas 
Dynamics,  edited  by  V.  Boffi  and  C.  Cercignanl,  Vol.  1,  1986,  pp.  369-383. 

2lB1rd,  G.  A.,  "Direct  Molecular  Simulation  of  a  Dissociating  Diatomic 
Gas,"  Journal  of  Computational  Physics,  Vol.  25,  Dec.  1977,  pp.  353-365. 

22Harvey,  J.  K. ,  "Inelastic  Collision  Models  for  Monte  Carlo  Simulation 
Computation,"  Progress  in  Astronautics  and  Aeronautics:  Rarefied  Gas 
Dynamics:  Theoretical  and  Computational  Techniques,  edited  by  E.  P.  Muntz, 

D.  P.  Weaver,  and  D.  H.  Campbell,  Vol.  117,  1989,  pp.  3-24. 

23Blanchard,  R.  C.  and  Larman,  K.  T.,  "Rarefied  Aerodynamics  and  Upper 
Atmospheric  Flight  Results  from  the  Orbiter  High  Resolution  Accelerometer 
Package  Experiment,"  AIAA  Paper  87-2366,  1987. 

24Dogra,  V.  K.,  Moss,  J.  N.,  and  Price,  J.  M.,  "Rarefied  Flow  Past  a 
Flat  Plate  at  Incidence,"  Progress  in  Astronautics  and  Aeronautics: 

Rarefied  Gas  Dynamics:  Theoretical  and  Computational  Techniques,  edited  by 

E.  P.  Muntz,  0.  P.  Weaver,  and  0.  H.  Campbell,  Vol.  118,  1989,  pp.  567-581. 

25Qogra,  V.  K.  Moss,  J.  N.,  "Hypersonic  Rarefied  Flow  About  Plates  at 
Incidence,"  AIAA  Paper  89-1712,  June  1989. 

26Celenl1g11 ,  M.  C.,  Moss,  J.  N. ,  and  Blanchard,  R.  C., 
"Three-Dimensional  Flow  Simulation  about  the  AFE  Vehicle  in  the  Transitional 
Regime,"  AIAA  Paper  89-0245,  January  1989. 

27Harvey,  J.  K.,  "Direct  Simulation  Monte  Carlo  Method  and  Comparison 
with  Experiment,"  Progress  In  Astronautics  and  Aeronautics:  Thermophysical 
Aspects  of  Re-Entry  Flows,  edited  by  J.  N.  Moss  and  C.  D.  Scott,  Vol.  103, 
1986,  pp.  25-43. 


28 


I 

2®Harvey,  j.  k. ,  Celenllgll,  H.  C.,  Dominy,  R.  G.,  and  Gilmore,  M.  R., 

"A  Flat-Ended  Circular  Cylinder  in  Hypersonic  Rarefied  Flow,”  AIAA  Paper 
89-1709,  June  1989. 

29Taylor,  J.  C. ,  Moss,  J.  N.,  and  Hassan,  H.  A.,  "Study  of  Hypersonic 
Flow  Past  Sharp  Cones,"  AIAA  Paper  89-1713,  June  1989. 

30Fiscko,  K.  A.,  and  Chapman,  0.  R.,  "Comparison  of  Burnett, 

Super-Burnett,  and  Monte  Carlo  Solutions  for  Hypersonic  Shock  Structure," 

Progress  in  Astronautics  and  Aeronautics:  Rarefied  Gas  Dynamics: 

Theoretical  and  Computational  Techniques,  edited  by  E.  P.  Muntz,  D.  P. 

Weaver,  and  D.  H.  Campbell,  Vol .  117,  1989,  pp.  374-395. 

3lpham-Van-D1ep,  G.  C.  and  Erwin,  D.  A.,  "Validation  of  MCDS  by 
Comparison  of  Predicted  with  Experimental  Velocity  Distribution  Functions  in 
Rarefied  Normal  Shocks,"  Progress  in  Astronautics  and  Aeronautics:  Rarefied 
Gas  Dynamics:  Theoretical  and  Computational  Techniques,  edited  by  E.P. 

Muntz,  0.  P.  Weaver,  and  0.  H.  Campbell,  Vol.  118,  1989,  pp.  271-283. 

32Erwin,  D.  A.,  Muntz,  E.  P.,  and  Pham-Van-Diep,  G.,  "A  Review  of 
Detailed  Comparisons  Between  Experiments  and  DSMC  Calculations  in 
Nonequilibrium  Flows,"  AIAA  Paper  89-1883,  June  1989. 

33Pham-Van-Diep,  G.,  Erwin,  D.,  and  Muntz,  E.  P.,  "Nonequilibrium 
Molecular  Motion  in  a  Hypersonic  Shock  Wave,"  Science.  Vol.  245,  August 
1989,  pp.  624-626. 

34Alsmeyer,  H.,  "Density  Profiles  in  Argon  and  Nitrogen  Shock  Waves 
Measured  by  the  Absorption  of  an  Electron  Beam,"  Journal  of  Fluid  Mechanics, 

Vol.  74,  1976,  pp.  497-513. 

"Maitland,  G.  C.,  Rigby,  M. ,  Smith,  E.  B.,  and  Wakeham,  W. , 

"Intermolecular  Forces,"  Clarendon  Press,  Oxford,  1981. 

"Bird,  G.  A.,  "Computation  of  Electron  Density  in  High  Altitude 
Re-Entry  Flows,"  AIAA  Paper  89-1882,  June  1989. 

37Cheng,  H.  K. ,  Lee,  C.  J.,  Wong,  E.  Y.,  and  Yang,  H.  T.,  "Hypersonic 
Slip  Flows  and  Issues  on  Extending  Continuum  Mode)  Beyond  the  Navier-Stokes 
Level,"  AIAA  Paper  89-1663,  June  1989. 

"Gokcen,  T.  and  MacCormack,  R.,  "Nonequilibrium  Effects  for  Hypersonic 
Transitional  Flows  Using  Continuum  Approach,"  AIAA  Paper  89-0461,  Jan.  1989. 

39Chrusc1el,  G.  T.  and  Pool,  L.  A.,  "Knudsen-Layer  Properties  for  a 
Conical  Afterbody  In  Rarefied  Hypersonic  Flows,"  Progress  in  Astronautics 
and  Aeronautics:  Rarefied  Gas  Dynamics:  Theoretical  and  Computational 
Techniques,  edited  by  E.  P.  Muntz,  D.  P.  Weaver,  and  D.  H.  Campbell,  Vol. 

118,  1989,  pp.  464-475. 

40Gupta,  R.  N.  and  Simmonds,  A.  L.,  "Hypersonic  Low-Density  Solutions  of 
the  Navier-Stokes  Equations  with  Chemical  Nonequilibrium  and  Multicomponent 
Surface  Slip,"  AIAA  Paper  86-1349,  June  1986. 


i 

i 


29 


41Gnoffo,  P.  A.,  Gupta,  R.  N.,  and  Shinn,  J.  L.,  "Conservation  Equations 
and  Physical  Models  for  Hypersonic  Air  Flows  in  Thermal  and  Chemical 
Nonequilibrium,"  NASA  TP-2867,  February  1989. 

^Blanchard,  R.  C.,  "Rarefied  Flow  Lift-to-Drag  Measurements  of  the 
Shuttle  Orbiter,"  15th  Congress  of  International  Council  of  Aeronautical 
Sciences,  Paper  ICAS-86-2.10.1,  London,  England,  September  7-12,  1986. 

43Blanchard,  R.  C.,  Hendrix,  M.  K.,  Fox,  J.  C.,  Thomas,  D.  J.,  and 
Nicholson,  J.  Y.,  "Orbital  Acceleration  Reserch  Experiment,"  Journal  of 
Spacecraft  and  Rockets.  Vol.  24,  No.  6,  November-December,  1987. 

44Aerodynamic  Design  Data  Book,  Volume  1,  Orbiter  Vehicle  102, 
SD72-SH-0060,  Volume  IM,  November  1980,  Space  Division,  Rockwell 
International . 

45Blanchard,  R.  C.  and  Rutherford,  J.  F.,  "Shuttle  Orbiter  High 
Resolution  Accelerometer  Package  Experiment:  Preliminary  Flight  Results," 
Journal  of  Spacecraft  and  Rockets.  Vol.  22,  No.  4,  July-August  1985, 
pp.  474-480. 

46Wood,  G.  M. ,  Wilmoth,  R.  G.,  Carlomagno,  G.  M. ,  and  deLuca,  L., 
"Proposed  Aerothermodynamic  Experiments  in  Transition  Flow  "  '?g  the 
NASA/ASI  Tethered  Satellite  System-2,  AIAA  90-0536,  January  1990. 

47Bird,  G.  A.,  "Thermal  and  Pressure  Di- fusion  Effects  in  High  Altitude 
Flows,"  AlAA  Paper  88-2732,  June  1988. 

48Wells,  W.  L.,  "Wind  Tunnel  Preflight  Teat  Drogram  for  Aeroassist 
Flight  Experiment,"  AIAA  Paper  87-2367-CP,  August  1987. 

49Celenl igi 1 ,  M.  C.  and  Moss,  J.  N.,  "Direct  Simulation  of  Hypersonic 
Rarefied  Flow  About  a  Delta  Wing,"  AIAA  Paper  90-0143,  January  1990. 

50a  1  legre,  J.,  Private  Communications,  Laboratoire  d'Aerothermique  de 
CNRS,  4  ter  Route  des  Gardes,  F92190  Meudon,  France. 

51Hueser,  J.  E.,  Melfi,  L.  T.,  Bird,  G.  A.,  and  Brock,  F.  J.,  "Rocket 
Nozzle  Lip  Flow  by  Direct  Simulation  Monte  Carlo,"  AIAA  Journal  of 
Spacecraft  and  Rockets.  Vol.  23,  July-August  1986,  pp.  363-367. 

52B1rd,  G.  A.,  "Influence  of  Local  Configuration  on  the  Back  Flow  from 
Small  Thruster  Rockets,"  AIAA  Paper  90-0147,  Janaury  1990. 

53Usami,  M. ,  Fujimoto,  T.,  and  Kato,  S.,  "Monte  Carlo  Simulation  on  Mass 
Flow  Reduction  due  to  Roughness  of  a  Slit  Surface,"  Progress  in  Astronautics 
and  Aeronautics:  Rarefied  Gas  Dynamics:  Space  Related  Studies,  edited  by 
E.  P.  Muntz,  D.  P.  Weaver,  and  D.  H.  Campbell,  Voi.  116,  1989,  pp.  283-297. 


54Nanbu,  K.,  Watanabe,  Y.,  Igarashl,  S.,  Dettleff,  G.,  and 
Koppenwal Iner,  G.,  "Effectiveness  of  a  Parallel  Plate  Arrangement  as  a 
Cryogenic  Pumping  Device,"  Progress  In  Astronautics  and  Aeronautics  Rarefied 
Gas  Dynamics:  Space-Related  Studies,  edited  by  E.  P.  Muntz,  0.  P.  Weaver, 
and  D.  H.  Campbell,  Vol.  116,  1989,  pp.  283-297. 

55Watanabe,  Y.,  Nanbu,  K. ,  and  Igarashl,  S.,  "Angular  Distributions  of 
Molecular  Flux  Effusing  from  a  Cylindrical  Crucible  Partially  Filled  with 
Liquid,"  Progress  in  Astronautics  and  Aeronautics:  Rarefied  Gas  Dynamics: 
Space-Related  Studies,  edited  by  E.  P.  Muntz,  D.  P.  Weaver,  and  D.  H. 
Campbell,  Vol.  116,  1989,  pp.  283-297, 

56McDonald,  J.  D.  and  Baganoff,  D.,  "Vectorlzation  of  a  Particle 
Simulation  Method  for  Hypersonic  Rarefied  Flow,"  AIAA  Paper  88-2735,  June 
1988. 

57Feiereisen,  W.  J.  and  McDonald,  J.  D.,  "Three-Dimensional  Discrete 
Particle  Simulation  of  an  AOTV,"  AIAA  Paper  89-1711,  June  1989. 

58W1lmoth,  R.  G.,  "Olrect  Simulation  Monte  Carlo  Analysis  on  Parallel 
Processors,"  AIAA  Paper  89-1666,  June  1989. 

53Fur1ani,  T.  R.  and  Lordl,  J.  A.,  "A  Comparison  of  Parallel  Algorithm 
for  the  Direct  Simulation  Monte  Carlo  Method  II:  Applications  to  Exhaust 
Plume  Flowflelds,"  AIAA  Paper  89-1167,  June  1989 

60Bird,  G.  A.,  “Private  communications,  Department  of  Aeronautical 
Engineering,  J07,  University  of  Sydney,  N.S.W.,  2006,  Australia. 

61Boyd,  I.  D.,  "Direct  Simulation  of  Rotational  and  Nonequilibrium," 
AIAA  Paper  89-1880,  June  1989. 

62Carlson,  A.  B.  and  Hassan,  H.  A.,  "Direct  Simulation  of  Reentry  Flows 
with  Ionization,"  AIAA  Paper  90-0144,  January  1990. 


200 


150 


Al,kmde’  1<X)i 


50 


Shuttle  entry 

TSS-2  — 

8s  £  10%  Standoff 
Rn=  10cm 


r 


AFE 

ASTV 


NASP  flight  corridor 


4  6  8  10  12 

V*,  km/s 


Fig.  I  Hypersonic  flow  environment. 


(a)  n  -  0.065. 

Fig.  4  Comparison  of  molecular  velocity  distributions 33 
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Fig.  5  Calculated  maximum  atomic  mass  fractions  along 
stagnation  streamline  (V.  -  7.5  ka/s 
and  Rjj  -  2.54  cm). 


Fig.  8  Calculated  lift  and  drag  coefficients 

for  a  flat  plate  (V*  -  7.5  km/*,  a  -  40*) 


Pig.  6  Stagnation  heat-transfer  coefficient  versus 
Cheng's  rarefaction  parameter. 


Fig.  9  Effect  of  rarefaction  on  aerodynamic 

characteristics  (V.  •  7.5  km/s,  a  -  40*) 


Pig.  7  Calculated  stagnation  streamline  density 
(Alt  -  70  km,  V„  -  7.5  km/e, 

RfV  •  2.54  cm,  Axlsy.). 
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RECENT  PROGRESS 
IN  REACTIVE  FLOW  COMPUTATIONS 

B.  LARROUTUROU 

CERMICS,  INRIA,  Sophia-Aniipolis,  06560  VALBONNE,  FRANCE 


1.  INTRODUCTION 

We  report  in  this  paper  on  some  recent  progress  concerning  a  family  of  approxi¬ 
mation  schemes  for  the  numerical  simulation  of  a  multi-dimensional  flow  of  a  reactive 
perfect  or  real  gaseous  mixture. 

All  these  schemes  basically  employ  a  second-order  accurate  upwind  approximation, 
using  slope  limiters  in  order  to  give  an  oscillation-free  solution.  Moreover,  they  operate 
on  any  (and  in  particular  on  a  possibly  unstructured)  finite-element  mesh.  Thus,  these 
schemes  generalize  for  multi  component  flows  the  finite-element  upwind  schemes  deve¬ 
loped  for  a  single  gas  by  A.  Dervieux  and  L.  Fezoui  [10],  [15],  in  the  spirit  of  the  MUSCL 
(Monotonic  Upwind  Schemes  for  Conservation  Laws)  methodology  of  Van  Leer  [40]. 

The  schemes  improvements  discussed  below  are  in  particular  related  to  the  coupling 
of  the  mass  fractions  equations  with  the  basic  hydrodynamic  equations.  In  particular, 
we  derive  a  family  of  schemes  which  have  the  property  of  preserving  the  maximum 
principle  (and  in  particular  the  positivity)  for  the  mass  fractions  of  all  species  in  the 
gaseous  mixture. 

The  way  this  coupling  between  the  mass  fractions  equations  and  the  gas  dyna¬ 
mics  equations  is  actually  taken  into  account  in  the  discrete  approximation  is  based  on 
the  study  of  a  generalized  Riemann  problem  for  one-dimensional  multi-component  gas 
dynamics;  this  Riemann  problem  is  discussed  in  Section  2.  Then  Section  3  is  devoted 
to  the  discussion  of  the  basic  multi-component  approximation  schemes  in  one  space 
dimension,  while  the  extension  to  multi-dimensional  flows  is  described  in  Section  4. 

These  schemes  have  revealed  to  provide  robust  and  accurate  solutions  of  many 
reactive  flows  at  various  regimes,  ranging  from  highly  subsonic  (Mach  number  of  the 
order  of  10-3)  to  hypersonic  (Mach  number  of  the  order  of  20)  reactive  flows,  in  simple 
or  complex  geometries,  and  in  two  or  three  space  dimensions.  As  an  illustration  we 
present  and  discuss  several  numerical  examples  in  Section  5. 
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2.  THE  MULTI-COMPONENT  EULER  EQUATIONS 


2.1.  The  equations 


As  a  first  step,  we  consider  the  one- dimensional  inviscid  flow  of  a  non-reactive 
mixture  of  N  species  Ej,  E2  ...  Ejv-  The  governing  equations  for  this  flow  express  the 
conservation  of  momentum  and  of  total  energy  and  the  conservation  of  mass  for  each 
component.  They  take  the  form  (see  e.g.  [43]): 

[  (pu) t  +  (pu2  +  p)z  =  0  , 

<£,  +  [«(£  +p)|z  =  0,  (2.1) 

l(/>n)<  +  (pun),  =0  for  1  <k<N, 


where  p  is  the  mixture  density,  u  is  the  mixture  velocity  (which  is  also  the  velocity  of 
each  species,  since  we  neglect  here  molecular  diffusion),  p  is  the  total  pressure  in  the 
mixture,  £  is  the  total  energy  per  unit  volume  and  Y*  is  the  mass  fraction  of  species  E* 
(that  is,  pYk  is  the  separate  density  of  species  Et,  and  Ylk=i  Tit  =  1). 


For  the  sake  of  simplicity,  we  first  assume  that  each  species  E*  obeys  the  perfect 
gas  laws,  and  in  particular  has  constant  specific  heats  at  constant  volume  and  pressure 

Cvk  and  Cpk ■  We  will  also  denote  7 *  the  ratio  7*  =  -p~,  and  M*  the  molecular  weight 

w* 

of  species  E*.  The  total  pressure  p  is  then  given  by  Dalton’s  law  as: 


P  = 


(2.2) 


where  R  is  the  universal  gas  constant  and  T  is  the  temperature  of  the  mixture  (the 
same  for  all  species).  Considering  that  the  N  species  may  have  different  specific  heats 
of  formation  h°k  ,  we  write  the  total  energy  £  as  (see  e.g.  [5],  [22],  [43]): 

e  =  (\pY*u2  +  pVkCvkT  +  pYkh\  \  .  (2.3) 

k—l  '  ' 

Since  the  temperature  does  not  appear  in  the  conservation  relations  (2.1),  we  can 
eliminate  it  in  (2.2)-(2.3)  and  consider  that,  in  (2.1),  the  pressure  p  is  given  by  the 
following  relation,  which  is  deduced  from  (2.2)-(2.3)  and  Mayer’s  relation  Mk(Crk  - 

Cvk)  =  R: 

P  =  (7-l)^-^pu2-X]p^*j  ■  (2.4) 


J8 


l 


Here,  7  is  the  local  ratio  of  the  specific  heats  of  the  mixture: 

Y,YkCpk 

mixture  _  k  _  k  f2  5^ 

(Cv)miz  turt  Y'.YkCyk  J^YkCvk 

k  k 

Remark  1:  In  fact,  several  of  the  numerical  methods  presented  below  also  apply  to 
mixtures  of  real  gases.  We  will  not  consider  this  case  in  detail  below,  referring  to  e.g. 
[12],  [18],  and  in  particular  to  [23]  where  a  general  presentation  is  given  for  mixtures 
where  the  relation  giving  the  total  pressure  has  the  form: 


P  =  p(e-  ~pu2,pYk ^  , 


which  is  in  particular  the  case  if  all  components  in  the  mixture  satisfy  Boyle’s  or  Mar- 
liotte’s  law  (see  [23]).  • 

2.2,  The  Riemann  problem 

For  the  sake  of  simplicity,  we  will  from  now  on  restrict  our  attention  to  the  case  of 
a  mixture  made  of  only  two  species  Hi  and  £2;  but  all  results  presented  below  can  be 
straightforwardly  extended  to  mixtures  consisting  of  any  number  of  components  N. 

Thus,  we  rewrite  (2.1)  as: 


pu  \ 

/  pu2  +p 

f  1  + 

“(£  +  p) 

pYx  + 

puYx 

pYj, 

\  puYj 

System  (2.7)  can  be  rewritten  in  a  simpler  equivalent  form.  Simply  denoting  Y  the 
mass  fraction  Fj  of  the  first  species  and  E  the  sum  of  the  kinetic  and  thermal  energy 
per  unit  volume: 


E  =  e-^pYkh°k . 


we  get: 


p  \  (  PU  \ 

pu  .  />«  +  p 

E  +\u(E  +  p) 

pYJ,  \  puY  )z 
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where  the  first  equation  expresses  the  conservation  of  mass  for  the  mixture  and  has  the 
usual  form  of  the  continuity  equation  for  a  single  fluid,  and  where: 


with: 


P  =  (7  ~  1)  » 

^£«>l7l  +  (1  — 


7  = 


(2.10) 


(2.11) 


YCvl  +  (1  -  Y)Cvi 

We  will  use  the  classical  notations  W  and  F  for  the  vectors  of  the  conservative  variables 
and  of  the  fluxes: 

'F1 

\ 

(2.12) 


pu 

,,2 


pu'+p  Iff’2 
’  *  -  1  u(E  +  p)  1  -  1  «•» 
puY 


F 3 

F* 


Then,  we  have  the  two  following  simple  results  (which  are  shown  in  e.g.  [1),  [2], 
[23]): 

Proposition  1: 

The  flux  vector  F  is  an  homogeneous  function  of  degree  1  of  W.  • 

Proposition  2: 

If  the  specific  heat  latio  ~/k  of  each  species  in  the  mixtuxe  satisfies  the  inequality: 


7*  >  1  . 


(2.13) 


then  the  system  (2.9)  is  hyperbolic.  • 

The  proof  of  these  results  is  straightforward.  Proposition  1  simply  follows  from  the 
observation  that  7  =  7(W)  is  homogeneous  of  degree  0.  And  the  assumption  (2.13)  is 
needed  in  order  to  insure  that  ~((W)  >  1  for  any  W  since  the  last  equality  in  (2.5)  shows 
that  the  local  value  of  7  is  a  linear  convex  combination  of  the  7*’s. 

Let  us  simply  make  precise  here  that  the  eigenvalues  of  the  4  x  4  Jacobian  matrix 

A(W)  =  Me: 

Aj  =u  — c,  Aj=u,  Aj  =  u,  A«  =  u  +  c,  (2.14) 

where  the  sound  speed  c  has  the  usual  expression: 
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but  with  the  local  value  (2.11)  of  7. 


Remark  2:  The  developed  expression  of  the  matrix  A(W)  will  be  useful  in  the  sequel. 
A  straightforward  calculation  shows  that: 

/  0  1 


A{W)  = 


-  rx 

2 


2 


0 

0  ^ 

7-1 

X 

7U 

uX 

0 

U  / 

-uY  Y 

where  H  =  ^  ~ P  is  the  specific  enthalpy  of  the  mixture,  and  where: 


(2.16) 


X  = 


Cv\C,:2{l\  ~  72) 


Cv\Cv-i(~i\  —  72  )T 


7  -  1  pi  VC  J  +  (1  -  nc„2]2  KC«  +  (1  -  K)C„2 


(2.17) 


Remark  3:  It  is  sh  >w  :.j  [23]  that,  for  gases  obeying  Boyle’s  law,  Proposition  1  still 
holds;  that  is,  the  pressure  p  in  (2.6)  is  then  an  homogeneous  function  of  degree  1.  • 

Remark  4:  Proposition  2  can  also  be  extended  to  real  gas  mixtures.  For  a  two- 
component  mixture,  setting  p'  =  pY  and  writing  (2.6)  under  the  form: 

P=p(e~  ^P«2,p,p'j  =P(e.P,p')  ,  (2-18) 

one  can  shew  that  the  system  (2.9)  is  hyperbolic  if  the  quantity 

c2  =pP  +  YpP'  +p,~  +p<-  (2.19) 

P  P 

is  always  positive.  The  eigenvalues  of  the  Jacobian  matrix  A(W)  are  then  again  given 
by  (2.14),  with  the  sound  speed  c  given  by  (2.19).  Moreover,  a  particularly  nice  simpli¬ 
fication  arises  when  the  homogeneity  property  holds  (see  Remark  3):  then,  the  sound 
speed  again  has  its  usual  expression 


c  = 


(2.20) 


the  definition  of  7  being  extended  from  perfect-gas  mixtures  to  real-gas  mixtures  by  the 
following  relation  (which  uses  the  partial  derivative  of  (2.18)): 


7  =  P«  +  1  • 


(2.21) 
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We  refer  to  [23]  for  the  details.  • 


Since  system  (2.9)  is  hyperbolic,  it  makes  now  sense  to  examine  the  Riemann 
problem  for  this  system.  Introducing  two  states  WL  and  W r,  we  consider  the  problem: 


W,  +  F(W)Z  =  0 
WL 
WR 


W(x,0)  = 


for  x  e  IR  , 
if  x  <  0  , 
if  x  >  0  . 


t  >  0  , 


(2.22) 


When  trying  to  solve  this  problem,  a  first  important  question  concerns  the  genuine 
nonlinearity  or  the  degeneracy  of  the  characteristic  fields  (see  [24]).  As  in  the  single¬ 
component  case,  the  answer  is  here  that  the  first  and  fourth  characteristic  fields  are 
genuinely  non  linear,  whereas  the  characteristic  fields  associated  with  the  eigenvalue  u 
tire  linearly  degenerate  (see  [23]).  Exactly  as  in  the  case  of  the  single-component  Euler 
equations,  it  is  then  possible,  by  analysing  the  shock  or  rarefaction  waves  associated 
with  the  non  linear  characteristic  fields  to  completely  solve  the  Riemann  problem  (2.22) 
for  any  left  and  right  state  Wl  and  Wr.  Referring  to  [1],  [2],  [23]  for  the  details,  we 
simply  describe  here  the  structure  of  the  solution  of  (2.22). 

This  exact  solution  W11  is  of  course  self-similar  (i.e.  Wn(x,t)  only  depends  on  the 
ratio  — ),  and  consists,  as  in  the  single-component  case,  of  four  constant  states  W(1), 
"to-  ^(3,,  W(4)  separated  by  shocks,  rarefaction  waves  or  a  contact  discontinuity. 
More  precisely,  as  shown  on  Figure  1,  W(1)  =  WL  and  W(2)  are  separated  by  a  wave 
associated  with  the  first  characteristic  field,  that  is  with  the  eigenvalue  A,  =  v  -  c, 
either  a  1-shock  or  a  1-rarefaction  wave;  W(2)  and  W(3j  are  separated  by  a  contact 
discontinuity  (associated  with  the  eigenvalue  u);  W(3j  and  W(4)  =  Wr  are  separated 
by  a  4-wave,  associated  with  A4  =  u  +  c.  Also,  the  pressure  p  and  the  velocity  u 
are  continuous  across  the  contact  discontinuity.  Last  but  not  least,  the  mass  fraction  Y 
remains  constant  across  the  1-wave  and  the  4- wave  (whatever  these  waves  are,  shocks  or 
rarefactions)  and  only  varies  accross  the  contact  discontinuity.  This  fact  has  important 
consequences.  Indeed,  7  is  constant  on  each  side  of  the  contact  discontinuity.  Thus, 
on  the  left  side  of  the  discontinuity  the  mixture  has  the  composition  of  the  state  Wl, 
and  behaves  as  a  single  perfect  gas  whose  specific  heat  ratio  is  jl  =  7 {Wl).  Analogous 
conclusions  hold  for  the  right  side  of  the  contact  discontinuity. 
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Figure  1:  The  solution  of  the  multi-component  Riemann  problem  (2.22). 

For  a  €  JR  and  t  >  0,  we  will  denote  W(<r,  Wi,  Wr)  (or  sometimes  simply  W(o)) 
the  value  of  Wn(<jt,t),  which  is  independent  of  t.  We  will  need  below  the  following 
property  of  VV: 

Proposition  3: 

For  any  states  Wl  and  W r,  the  following  equality  holds: 

f(M0)l  =  F'lW(0)l  *  {  £  l  j;W)j  >  00  ;  .  (2.23) 

Proposition  3  is  proved  in  [20) ,  [21].  The  proof  essentially  consists  in  showing  that 
F'[W(0)]  and  the  speed  of  the  contact  discontinuity  have  the  same  sign. 

Remark  5:  Solving  the  Riemann  problem  (2.22)  for  real-gas  mixtures,  with  the  equa¬ 
tion  of  state  (2.18),  is  more  difficult.  If  some  convexity  property  is  assumed  for  the 
pressure  law  (2.18),  one  can  again  show  that  the  first  and  fourth  characteristic  fields 
are  genuinely  non  linear,  and  the  Riemann  problem  can  be  solved  exactly  (see  [9],  [31]). 
Without  such  an  assumption,  the  genuine  non  linearity  may  fail,  and  there  exists  for 
the  moment  no  general  procedure  to  solve  (2.22).  In  all  cases,  one  can  show  that  the 
characteristic  field  remains  linearly  degenerate,  and  that,  in  the  exact  solution  of  (2.22), 


I 


the  mass  fraction.  Y  remains  constant  everywhere  except  at  the  contact  discontinuity 
associated  with  the  eigenvalue  u  (see  [23]).  • 


3.  ONE-DIMENSIONAL  MULTI-COMPONENT  UPWIND  SCHEMES 


Let  us  now  consider  the  numerical  solution  of  an  initial  value  problem  associated 
with  system  (2.9): 


/  Wt  +  F(W)X  =0  for  reJR,  <>0, 

\  W(i,0)  =  W°(x)  for  x  e  R  .  1  ' 

We  will  restrict  our  attention  in  this  section  to  explicit,  three-point,  first-order  accurate 
schemes  written  in  conservative  form.  In  other  words,  using  very  classical  notations,  we 
consider  numerical  schemes  of  the  form: 


W”+1 


Wi  ,  ■ftj+i/a  ~  ^  —  1/2  „ 

At  Ax 


(3.2) 


where  the  numerical  flux  <l>j+i/2  is  evaluated  using  a  "numerical  flux  function” 


<W  =  *(»v,w;+,)  •  (3.3) 

(we  write  instead  of  <£”+1/2  for  simplicity). 

The  knowledge  of  the  numerical  flux  function  $  defines  the  scheme  under  consi¬ 
deration.  Below,  we  will  consider  two  types  of  flux  functions,  based  on  approximate 
Riemann  solvers  (also  known  as  Godunov-type  schemes)  and  on  flux-vector  splitting 
techniques. 


3.1.  Multi-component  approximate  Riemann  solvers 

In  this  section,  we  will  consider  two  well-known  existing  approximate  Riemann 
solvers  which  have  been  developed  for  the  single-component  Euler  equations,  namely 
the  Roe  and  Osher  approximate  Riemann  solvers,  and  present  for  each  of  them  two 
different  generalizations  to  multi-component  problems,  which  we  will  refer  to  as  the 
fully-coupled  and  the  weakly-coupled  generalizations.  We  will  conclude  the  section  by 
discussing  the  relative  merits  and  drawbacks  of  both  extensions. 
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3.1.1.  The  fully-coupled  multi-component  Roe  scheme 

The  extension  of  Roe’s  scheme  [29]  to  the  two-component  system  (3.1)  has  been 
derived  in  [1]  and  [10].  The  two-component  numerical  flux  function  has  the  form: 

*(WL,  WR)  =  +  \  |  A  |  (WL  -  WR)  ,  (3.4) 

where  A  =  A(Wl,  Wr)  is  a  diagonalisable  matrix  satisfying  Roe’s  property: 

F(Wl)-F(Wr)  =  A{Wl-Wr),  (3.5) 

and  where  the  matrix  |A|  is  defined  as  follows:  the  diagonalisation  of  A  being  writ¬ 
ten  as  A  =  TAT-1  with  A  =  Diag[pi ,  p2  ■  •  •  pn],  we  set  |A|  =  T|A|T~'  with  |A|  = 
Diag[\ni\,\n2\---\tin\\- 

To  construct  this  matrix  A,  one  introduces  the  state  W  =  (p,  pit,  E,  pY)T ,  known 
as  “Roe’s  average  of  Wi  and  WR ” ;  this  state  is  defined  by  the  relations: 

_  _  UL\/PL  +  UR\/PR  £  _  Hlx fpL  +  fifty fPR  .g  g. 

\[pZ+s/Ph  '  \/pI+s/pr 

as  in  the  single-component  case,  and: 

V-  _  yLs/pT.  +  YrJph 

1  "  '  (  0 

(determining  p  is  not  necessary).  But  in  this  two-component  context,  unless  both  species 
in  the  mixture  have  the  same  specific  heat  ratio  7j  =  72  (that  is  unless  7  =  t{W)  is  a 
constant),  the  flux  Jacobian  matrix  A(W)  evaluated  at  this  Roe-averaged  state  W  does 
not  satisfy  property  (3.5).  Therefore,  the  matrix  A  is  to  be  chosen  different  from  A(W) 
(but  close  to  the  latter  since  we  want  the  extension  to  reduce  to  the  usual  Roe  scheme 
when  both  species  are  the  same).  The  result  given  in  [1],  [10]  is  the  following: 

/  0  1  0  0  \ 


(3  -  7)u  7-I  X 


—  uH  A ^~2 — "“3  -  uYX  H  —  (7-l)uJ  7U  uX 


0  u 
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where  7  =  7(W),  but  where  X  is  not  equal  to  X(W)  given  by  (2.17):  in  order  to  insure 
that  (3.5)  holds,  one  has  to  choose  (we  refer  to  [1]  for  the  detailed  algebra): 

X  =  CviCrthj  ~ lllIL  (3.9) 

YCvl  +  (1  -  Y)Cvi 

with: 

f  _  TL^+TRyfpZ  ^  (3.10) 

\fpL  +  y/pR 

3.1.2.  The  fully-coupled  multi-component  Osher  scheme 


The  extension  of  Osher’s  scheme  [28]  to  multi- component  flows  is  straightforward 
(exactly  in  the  same  way  as  we  have  seen  in  Section  2.2  that  the  extension  of  exact 
Riemann  solvers  is  straightforward),  and  has  been  done  by  Abgrall  and  Montagne  [2], 
The  extended  scheme  is  defined  by  the  numerical  flux  function: 


*(WL,WR)  = 


F(WL)  +  F(WR) 


A(W)  |  dW  , 


where  the  integration  is  carried  out  on  a  path  connecting  Wi  and  WR  in  the  state- 
space.  As  in  the  single-component  case,  the  integration  path  is  piecewise  parallel  to 
the  right  eigenvectors  of  the  flux  Jacobian  matrix  A ,  and  the  evaluation  of  the  inte¬ 
gral  in  (3.11)  relies  on  the  knowledge  of  the  Riemann  invariants  associated  with  each 
eigenvectors.  These  Riemann  invariants  are  given  in  e.g.  [2],  [23],  and  the  problem  of 
evaluating  the  integral  in  (3.11)  has  many  similarities  with  the  analogous  problem  in 
the  single-component  case,  since  Y  (and  therefore  7)  is  constant  along  those  pieces  of 
the  integration  path  which  are  parallel  to  the  first  or  to  the  last  eigenvector.  We  refer 
the  reader  to  [2]  for  the  details. 

3.1.3.  The  weakly-coupled  multi-component  Roe  and  Osher  schemes 


This  weakly-coupled  extension,  proposed  in  [20],  [21],  is  based  on  the  property 
(2.23)  of  the  exact  solution  of  the  multi-component  Riemann  problem  (2.22). 

In  order  to  introduce  it,  let  us  first  recall  that  in  the  Godunov-type  scher.es,  the 
numerical  flux  ^,+1/2  is  seen  as  an  approximation  of  the  flux  F[W(0;  W",  W"+i)]  of 
the  exact  solution  of  the  Riemann  problem  constructed  with  the  neighbour  states  W 
and  W"+1 .  One  even  has  the  equality,  4>}+\/%  =  F[W(0,  W",  W;n+1 )],  in  the  original 
Godunov  method  [19]. 
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With  this  idea  in  mind,  we  define  the  weakly-coupled  multi-component  extension 
as  follows.  The  first  three  components  of  <j>i+i/2  (that  is,  the  density,  momentum  and 
energy  discrete  fluxes)  are  evaluated  using  a  classical  single- component  scheme  (Roe  or 
Osher  scheme),  with  a  “frozen  7”.  This  means  that,  for  the  evaluation  of  the  first  three 

components  of  4>j+i/2  at  time  tn,  one  uses  the  flux  of  the  usual  (i.e.  single-component) 

(yn  yn  \ 

~ — 2  ""  ) ' 

Beside  this,  one  evaluates  the  fourth  component  of  the  numerical  flux  from  the 
following  relation,  which  mimics  (2.23): 


\YP 

\YHl  if  ^ 


Beside  its  simplicity,  this  approach  has  the  major  advantage  of  preserving  the 
maximum  principle  for  the  mass  fraction.  The  result  shown  in  [21]  is  the  following: 

Proposition  4: 

Under  the  following  CFL-like  conditions: 

At_  max(t;+l/a,0)  _  min(d,1+l/2,0)1  < 

Pi  P?+ 1  “  ’ 

the  weakly-coupled  multi-component  schemes  defined  above  preserve  the  maximum 
principle  for  the  mass  fraction  Y :  for  all  i  and  n  >  0: 

min  Y.°  <  Y”  <  max  Y,°  .  •  (3.14) 

i  1 

The  proof  of  Proposition  4  essentially  relies  on  the  fact  that  weakly-coupled  schemes 
use  the  same  discrete  mass  fluxes  for  all  discrete  mass  conservation  equations  (the 
continuity  equation  and  the  species  equation). 

Remark  6:  It  is  even  shown  in  [21]  that  the  one- dimensional  weakly- coupled  multi- 
component  Roe  or  Osher  schemes  preserve  the  mass  fraction  monotonicity.  It  is  also 
observed  there  that  condition  (3.13)  is  not  more  restrictive  than  the  usual  CFL  condition 
when  an  upwind  scheme  is  used  to  evaluate  the  first  three  components  of  4>j+\/ 2-  * 

3.1.4.  Fully-coupled  versus  weakly-coupled  multi-component  schemes 


Several  detailed  comparisons  between  the  fully-coupled  and  the  weakly- coupled 
approaches  have  been  carried  out,  for  both  Roe  and  Osher  schemes,  in  one  or  two  space 
dimensions  (see  [6],  [7],  [21]).  As  one  could  expect,  these  comparisons  have  shown  that: 
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*  The  weakly-coupled  approach  gives  more  accurate  values  for  the  computed 
mass  fractions  Y.  In  particular,  the  fully-coupled  approach  does  not  always 
preserve  the  inequalities  0  <  Y  <1:  computed  mass  fractions  values  between 
— 5.10-2  and  0,  or  between  1  and  1.05  are  sometimes  obtained  with  the  fully- 
coupled  schemes. 

*  The  weakly- coup  led  schemes  are  easier  to  implement  and  cheaper  them  the 
fully-coupled  schemes.  This  is  particularly  true  for  Hoe  scheme  when  the  number 
N  of  gaseous  species  in  the  mixture  is  large,  since  the  fully-coupled  Roe  scheme 
then  involves  N  +  2  x  N  +  2  matrices. 


*  When  7  does  not  actually  depend  on  the  mixture  composition  (that  is, 
if  both  species  have  the  same  specific  heat  ratio  71  =  72),  the  fully-coupled 
and  the  weakly-coupled  schemes  exactly  give  the  same  results  for  the  computed 
hydrodynamical  variables  p,  u,p.  On  the  opposite,  when  7  genuinely  depends  on 
y,  the  computed  values  of  the  hydrodynamical  variables  p,  u,p  are  slightly  less 
accurate  with  the  weakly-coupled  Roe  scheme  than  with  the  fully-coupled  Roe 
scheme.  This  could  be  expected  since  the  former  approach  uses  a  frozen  7  for 


evaluating  the  “Euler  fluxes”  j>1 ,  </>2 ,  tj>3 ,  whereas  the  latter  takes  the  variations 

dy 

of  7  into  account  (the  partial  derivatives  appear  in  the  matrix  A  (3.8)). 

However,  this  drawback  of  the  weakly-coupled  approach  vanishes  with  a  slight 


modification  of  the  scheme  (aiming  at  taking  the  variations  of  7  into  account  in 


the  evaluation  of  the  Euler  fluxes  instead  of  using  a  frozen  7;  we  refer  to  [6]  for 


the  details). 


Therefore,  the  weakly -coupled  method,  which  has  both  advantages  of  being  cheaper 
and  of  preserving  the  maximum  principle  for  the  mass  fraction,  should  be  preferred  to 
the  fully-coupled  approach.  This  is  in  particular  the  case  without  any  modification  of 
the  computation  of  the  Euler  fluxes  in  those  cases  where  the  specific  heat  ratio  7  is 
constant  (which,  as  said  above,  happens  if  7j  =  72,  but  also  if  the  variable  Y,  instead  of 
being  the  mass  fraction  of  a  species  in  the  mixture,  represents  any  other  quantity  which 
is  simply  convected  by  the  flow). 

3.2.  Multi-component  flux  vector  splitting 

The  best-known  flux  vector  splitting  techniques  developed  for  the  single-compo¬ 
nent  Euler  equations,  namely  the  methods  of  Steger  and  Warming  (34)  and  of  Van  Leer 
[41]  have  been  extended  to  multi-component  mixtures  in  [23].  Since  the  extension  of 
the  Steger  and  Warming  splitting  is  straightforward  (because  the  multi-component  flux 


vector  is  still  an  homogeneous  function  of  the  conservative  variables,  as  in  the  single¬ 
component  case),  we  will  simply  consider  here  the  differentiable  flux  vector  splitting  of 
Van  Leer. 

As  in  the  single-component  case,  the  numerical  flux  function  of  the  multi-component 
Van  Leer  scheme  has  the  form: 

*(Wl,  Wr)  =  F+(Wl)  +  F-(WR)  ,  (3.15) 


where,  for  any  W: 


F(W)  =  F+(W)  +  F-(W)  , 


and  where  F+  is  defined  by  the  following  expressions: 
•if  u>c=ypZ,F+(W)  =  F{W)-, 

*  if  —  c  <  u  <  c, 


F+(W)  = 


7* 

2(7j-d  n 


(the  first  three  components  of  F+  have  the  same  expression  as  in  the  single¬ 
component  case,  but  now  with  the  non  constant  coefficient  7); 

*  if  u  <  -c,  F+(W)  =  0. 

As  in  the  single-component  case,  we  have  here  the  property  that  if  all  characteristic 
wave  speeds  associated  with  the  state  W  are  positive  (resp.  negative),  then  F+(W)  = 
F(W)  (resp.  F-(W)  =  F(W)).  Moreover,  the  following  result,  which  says  that  this 
flux  splitting  can  be  used  to  define  a  stable  conservative  scheme,  is  proved  in  [23]  (the 
analogous  result  for  the  single-component  case  was  proved  in  [41]): 

Proposition  5: 

If  the  specific  heat  ratio  7 *  of  each  species  in  the  mixture  satisfies  the  inequality 

2)fV  D  F_ 

1  <  7*  <  3,  then  all  eigenvalues  of  the  Jacobian  matrix  (resp.  - )  are  real  and 

DW  uw 

positive  (resp.  negative).  • 
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Lastly,  we  can  notice  that,  exactly  as  the  weakly-coupled  Roe  and  Osher  schemes 
of  the  previous  section,  this  multi-component  Van  Leer  scheme  uses  for  the  discrete 
species  equations  the  same  discrete  mass  fluxes  Fj.  as  for  the  continuity  equation.  This 
allows  us  to  prove  that  this  scheme  also  preserves  the  maximum  principle  for  the  mass 
fraction  (see  [21]  for  the  details): 

Proposition  6: 

Under  the  following  CFL-like  conditions: 


**  i<! 

Az  2  Cj  ~  ’ 


(3.18) 


the  multi-component  Van  Leer  scheme  defined  above  preserves  the  maximum  principle 
for  the  mass  fraction  Y :  for  all  i  and  n  >  0: 


min  Y,°  <  V'n  <  max  Y,°  .  • 
J  i 


(3.19) 


4.  EXTENSIONS 


In  this  section,  we  briefly  give  an  idea  of  how  the  above  one-dimensional  multi- 
component  upwind  schemes  can  be  extended  to  second-order  accuracy,  to  implicit  time¬ 
stepping,  and  to  multi-dimensional  reactive  flows.  We  refer  to  the  bibliography  for  the 
details. 

4.1.  One-dimensional  extensions 

4.1.1.  Second-order  accuracy 

Starting  from  the  previous  first-order  accurate  schemes,  the  second-order  spatial 
accuracy  is  obtained  by  using  piecewise  linear  variables  instead  of  piecewise  constant 
variables,  following  the  MUSCL  approach  of  Van  Leer  [39],  [40].  To  obtain  schemes 
which  are  second-order  accurate  in  space  but  remain  first-order  accurate  in  time,  the 
method  involves  three  steps: 

(a)  At  each  time  step,  starting  from  the  values  W",  one  first  evaluates  slopes 
a"  for  all  variables  which  are  chosen  to  be  piecewise  linear.  Several  choices  are 
possible  at  this  stage  (for  instance,  one  can  choose  either  the  conservative  varia¬ 
bles  p ,  pu,  E,  pY  or  the  “physical  variables”  p,  u,  p,  Y  to  vary  linearly  in  each 
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computational  cell;  see  (11]);  here,  we  take  Y  (and  not  pY)  as  a  piecewise  linear 
variable. 


( b )  Slope  limiters  are  then  used  in  order  to  avoid  the  creation  of  new  extrema; 
here  again  different  strategies  exist  to  evaluate  the  limited  slopes  (see  e.g.  [11], 
[30],  [32],  [38],  [39]).  Essentially,  the  slopes  are  limited  in  order  to  avoid  the 
creation  of  new  extrema,  i.e.  they  are  constrained  such  that,  taking  the  variable 
Y  as  an  example,  we  have: 


min  Y,n  <  K"  ± 


Ax 


*7*  • 


(4.1) 


(c)  The  limited  slopes  are  then  used  to  evaluate  cell-interface  values  ± 

(one  sets  Tj“1/2  _  =  Yxn  -f  4j-s",  Yt-i/2,+  =  Yi"  ~  ^•SD.  and  the  solution  is 
advanced  in  time  according  to  relation  (1.5),  where  we  now  take: 

=  *( W?+1/2,_,  W?+1/2i+)  •  (4-2) 


This  construction  (a)-(c)  can  be  applied  to  any  of  the  schemes  presented  in  the 
preceding  sections,  by  using  in  (4.2)  the  corresponding  numerical  flux  function  $.  In 
particular,  it  is  shown  in  [21]  that  the  result  of  Proposition  4  still  holds:  the  weakly- 
coupled  second-order  accurate  schemes  preserve  the  maximum  principle  for  the  mass 
fractions,  provided  that  limited  slopes  are  used  for  the  mass  fraction  Y  itself  and  not 
for  pY. 

The  extension  to  a  second-order  time-accurate  scheme  can  also  be  done  in  the  usual 
way  using  a  predictor-corrector  formulation  (see  e.g.  [11],  [15],  [21]). 


4.1.2.  Implicit  time-stepping 


All  the  above  described  explicit  schemes  can  be  made  implicit,  following  the  lines 
of  [10],  [17],  [36]  (these  works  deal  with  the  single-component  case).  To  give  the  idea, 
one  uses  instead  of  the  explicit  formulation  (3.2)  the  following  (linearized  implicit) 
conservative  formulation: 


jyn+l 


w?  <t>' 


n+1 

,1+1/3 


xn+1 

Q-  1/3  _ 


At 


Ax 


=  0, 


(4.3) 


where  the  numerical  flux  0"^,' 


/3 


has  the  form: 


*;++,‘/3  =  m) + - w” 


)+ 


<9$ 


dW, 


>+! 


TCV  -  "?+,) 


>+!' 


(4.4) 
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the  quantities  —  —  and  -5— —  being  the  Jacobian  matrix  (or  an  approximate  Jacobian 

Urwj 

matrix)  of  the  numerical  flux  function  (see  [10],  [17],  [36]). 

Since  no  particular  difficulty  appears  in  extending  these  implicit  formulations  from 
the  single-component  case  to  the  multi-component  case  (we  refer  to  [13]  for  the  implicit 
fully-coupled  multi-component  Roe  scheme,  and  to  [18]  for  the  implicit  multi-component 
Van  Leer  scheme),  we  simply  add  now  some  comments  on  the  implicit  weakly-coupled 
multi-component  schemes.  For  these  schemes,  (3.12)  simply  becomes: 


,4,n+l  _  ,l,n+l 

*i+ 1/2  _  0i+ 1/2 


/  y?+l 


,'n+1  >  0 


if  *i+i/a 

^  *!;"£<  0 


(4.5) 


Therefore  the  solution  of  a  linear  system  is  required  at  each  time  step  to  evaluate  the 
mass  fractions  y"+1;  it  is  shown  in  [21]  that  the  matrix  of  this  linear  system  is  an 
M-matrix  (see  e.g.  [42]),  and  that  the  properties  of  the  explicit  weakly-coupled  schemes 
still  hold  here,  without  any  restriction  qjj^the  time  step: 

Proposition  7: 

For  any  value  of  the  time  step  At,  the  implicit  weakly-coupled  schemes  defined  by 
(4.5)  preserve  the  maximum  principle  for  the  mass  fraction  Y :  for  all  i  and  n  >  0: 


minK?  <  Y"  <  maxY?  .  • 
i  J  i 


(4.6) 


4.2.  Extension  to  two-dimensional  reactive  flows 

Let  us  now  turn  to  multi-dimensional  flows,  with  the  description  of  diffusive  and 
reactive  effects.  To  present  this  extension,  we  will  consider  the  explicit  simulation  of  a 
two-dimensional  laminar  inviscid  flow  of  a  mixture  of  two  reactive  species  £1  and  £2 
(the  extension  to  three-dimensional  flows  and  to  viscous  flows  can  be  done  along  the 
same  lines;  see  [16],  [32],  [36],  [37]).  Thus,  we  consider  the  following  system  of  equations: 

'  Pt  +  (/*»)*  +  (H»  =  0  . 

OOi  +  (puJ  +  p)t  +  (puv),  =  0  , 

(pv),  +  (puv)x  +  (pv2  +  p)y  =  0  , 

|  Et  +  [u(E  +  p)}z  +  {v(E  +  p))t  =  ^.(xVT)  +  Ht+  £  V.(PDCpkT^Yk)  ,  (4'?) 

*=> 

l  (PY)t  +  (puY):  +  (p»y),  =  V.(pD^Y)  +  fly  , 
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with: 


E=  'tpYkCVkT+\p(u*  +  v*), 

k=  1 


2 

p-  E 


k=  1 


pYtRT 
Mk  ' 


(4.8) 


Our  notations  are  classical  :  u  and  v  axe  the  components  of  the  mixture  velocity  Tf , 
E  is  again  the  sum  of  the  internal  and  kinetic  energies  per  unit  volume,  A  is  the  mixture 
thermal  conductivity,  D  is  the  molecular  diffusion  coefficient  of  species  Ei.  Lastly,  the 
source  terms  fir  and  fty  represent  the  contribution  of  the  chemical  reactions  to  the 
energy  and  mass  fraction  equations.  We  will  assume  below  that  the  quantities  A  ,  pD, 
Cpk,  and  Cvk  axe  constant. 

As  said  in  the  introduction,  we  consider  mixed  finite-element  /  finite-volume  two- 
dimensional  extensions  of  the  upwind  schemes  presented  above,  in  the  spirit  of  the 
methods  developed  by  A.  Dervieux  and  L.  Fezoui  [10],  [15]  for  a  single  gas  (but  of 
course  all  above  one-dimensional  schemes  can  also  be  extended  to  structured  finite- 
volume  multi-dimensional  meshes).  To  make  it  precise,  let  us  first  rewrite  system  (4.7)- 
(4.8)  under  the  following  form,  separating  the  time-dependent,  convective,  diffusive  and 
reactive  terms: 


wt  +  F(W)Z  +  <5(1*0,  =  P(W,WZ)Z  +  Q(W,  W,),  +  5(W)  ,  (4.9) 

u/norp' 

f  W  =  (p,pu,pv,E,pY)T  , 

1  F(W)  =  (ptt,pu7  +  p,puv,u(E  +  p),puY)T  ,  (4-10) 

[  G(W)  =  (pv,ptiv,pv2  +  p,v(E  +  p),pvY)T  , 

2 

P(W,WZ)  =  (0,0,0,Arr  +  J2pDCPkT(Yk)z,pDYz)T  , 

7  (4-11) 

Q(W,Wy)  =  (0,0,0,  AT,  +  Y/pDCpkT{Yk)„pDY,)T  , 

*=1 

5(WO  =  (0,0,0,nT,«r)r .  (4.12) 

Then  we  introduce  a  (possibly  unstructured)  finite-element  triangulation  of  the 
computational  domain.  In  order  to  derive  a  finite-volume  formulation,  we  consider  a 
dual  partition  of  the  domain  in  control  volumes  or  cells  :  a  cell  C,  is  constructed  around 
each  vertex  Si  by  means  of  the  medians  of  the  neighbouring  triangles,  as  shown  on 
Figure  2. 
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Figure  2:  The  control  volume  C,. 

Integrating  (4.9)  on  the  control  volume  C\,  we  get: 

f  [  Wl+  [  (F t/f  +  Gvf)  =  f  (Pvf+Qv*)+  [  [  S,  (4.13) 
J  Jci  Jaci  Jac>  J  JCi 


where  T?  —  (v*,vf)  is  the  outward  unit  normal  on  dC,.  It  now  remains  to  specify  how 
the  four  integrals  in  (4.13)  are  evaluated. 

The  time  derivative  and  source  terms  integrals  are  approximated  using  a  in  ass- 
lumped  approximation: 


area(C')  , 


5  =  5(W,n)  .  area(C,)  . 


(4.14) 


In  addition  to  its  simplicity,  the  mass-lumped  approximation  has  two  advantages:  first, 
it  allows  us  to  employ  an  explicit  time  integration  scheme,  which  is  no  longer  possible 
when  the  consistent  non  diagonal  finite-element  mass  matrix  is  used;  moreover,  the 
mass-lumped  approximation  of  the  heat  equation  may  preserve  the  positivity  of  the 
unknowns,  while  a  consistent  finite-element  formulation  does  not  (see  e.g.  [8]). 


Next,  we  have  to  consider  the  integral  of  the  diffusive  fluxes  in  (4.13).  In  view  of 
the  definitions  (4.11)  of  P  and  Q.  this  integral  reduces  to  expressions  like: 


and 
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(4.15) 


To  evaluate  these  terms,  we  consider  here  that  the  integrands  are  constant  in  each 
triangle  r  of  the  triangulation.  More  precisely,  we  consider  that,  in  a  triangle  r  with 
vertices  Sj  (1  <  j  <  3),  we  have: 


(4.16) 


where  ^  is  the  Pi  finite-element  basis  function  associated  to  vertex  Sj  and,  for  the  last 
term  in  (4.15): 


(4.17) 


Then  the  diffusive  term  in  (4)  takes  the  value: 


(4.18) 


where  PT  and  Qr  are  the  constant  values  of  P  and  Q  in  the  triangle  r.  It  is  easy  to 
check  that  (4.16)-(4.18)  is  equivalent  to  a  classical  Pi  finite-element  discretization  of 
the  diffusive  terms. 


Lastly,  we  come  to  the  approximation  of  the  second  integral  in  (4.13),  which  is 
based  of  course  on  the  one-dimensional  multi-component  upwind  schemes  presented  in 
the  previous  sections.  Indeed,  the  system  Wt  -f  F{W)X  +  G(W)y  =  0  is  a  nonlinear 
hyperbolic  system  of  conservation  laws,  which  means  that,  for  any  (a,  P)  €  -K2,  the 

.  dP  ad6  ,  . 

matrix  a — —  +  p — r-  has  five  real  eigenvalues: 
dW  dW 

(Aj  =  aii  +  pv  —  \f or2  -f-  P2c  , 

A2  =  A3  =  A4  =  au  +  pv  ,  (419) 

A5  =  au  4-  Pv  +  y/a2  +  P2c  , 


with  c  = 


and 


complete  set  of  real  eigenvectors.  Thus,  we  can  extend  all 


approximations  defined  in  Section  3  for  the  one-dimensional  flux  vector  F  to  the  flux 
vector  aF  +  PG  (in  other  words,  we  use  here  the  rotational  invariance  of  the  multi- 
component  Euler  equations).  For  instance,  given  two  values  Wi  and  Wr  of  W,  and 
a  vector  If  =  (r)z,r)v),  we  define  a  fully-coupled  two-dimensioned  Roe  numerical  flux 
function  $  by: 


Wl,  #R,  T)  =  \[^{WL)  +  ?,(#*))  +  i|i,|(^«  -  Wt)  .  (4.20) 
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In  this  expression,  we  have  set  F^{W)  =  i^xF(W)  +  r fG(W),  and  A,  =  An(WL,WR) 
is  a  diagonalisable  matrix  satisfying  Roe’s  property: 


Fn(wL)  -  rv(wR)  =  a„{wr  -  wL) , 


(4.21) 


(the  matrix  A,  is  deduced  from  A  in  (3.8)). 

To  evaluate  the  second  integral  in  (4.13),  we  first  write  it  in  the  form: 


/  (Fv*  +  Gv\)  =  53  [  (Fv?  +  Gv?) 

JSCi  >€«(»)  ^dCi‘ 


(4.22) 


where  K(i)  is  the  set  of  neighbouring  nodes  of  St,  and  where  dCl}  —  dC,  n  dC}.  Then, 
defining  the  vector  ujj  =  (»/?,  i /,*  )  by: 


i/ 


X  _ 


(4.23) 


we  obtain  a  first-order  accurate  upwind  approximation  of  the  convective  fluxes  (4.22) 
by: 

(Fuf  +  Gv\)  =  ^  $(W„  ^,*1)  ,  (4-24) 

iGAC(i) 

with  4  defined  in  (4.20). 

This  extension  to  two  space  dimensions  can  of  course  be  used  also  for  the  weakly- 
coupled  schemes.  Then,  the  first  four  components  of  <I(H',1,  W,,  i/^)  are  evaluated  using 
a  single-component  scheme  (with  7  frozen  if  need  be),  and  the  fifth  component  is  given 
by: 


Yi  if  4>'(W„^,i^)>0, 
Y,  if  . 


(4.25) 


It  is  straightforward  to  check  that  this  two-dimensional  weakly- coupled  method 
still  preserves  the  maximum  principle  for  the  mass  fraction.  Also,  transforming  a  multi¬ 
dimensional  Euler  code  into  a  multi-dimensional  multi-component  code  using  (4.25)  is 
very  easy  and  cheap. 

We  do  not  present  in  detail  here  how'  the  limited  second-order  extension  of  this 
first-order  numerical  fluxes  is  derived.  The  main  task  in  this  derivation  is  to  choose 
the  slopes  limiters,  a  problem  which  is  obviously  less  simple  in  the  present  context  of 
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two-dimensional  unstructured  grids  than  in  one  space  dimension.  We  refer  to  e.g.  [10], 
[11],  [15],  [32]  for  more  details. 


5.  NUMERICAL  EXAMPLES 


We  now  briefly  present  three  different  numerical  illustrations  of  the  above  methods, 
all  dealing  with  two-dimensional  reactive  flows. 

5.1.  Flame  propagation  in  a  closed  vessel 

To  show  the  ability  of  these  methods  to  operate  on  unstructured  triangulations,  we 
first  consider  two  examples  of  flame  propagation  problems  where  adaptive  highly  non 
uniform  grids  are  used. 

The  first  example  is  taken  from  [26],  where  a  dynamic  mesh  refinement  prodedure 
is  used  to  follow  the  propagating  flame.  Referring  to  [4],  [25],  [26]  for  the  details,  we 
simply  mention  here  the  basic  features  of  this  adaptive  procedure.  It  uses  a  multi-level 
triangular  finite-element  mesh  with  a  filiation  hierarchy  between  two  consecutive  levels. 
This  procedure  dynamically  refines  and  unrefines  the  mesh,  using  refinement  decisions 
based  on  some  refinement  criterion.  When  the  adaption  routine  is  called,  it  starts  from 
an  original  coarse  mesh,  makes  refinement  decisions  at  that  level  (level  0),  and  creates 
a  new  mesh  (of  level  1)  by  local  element  division;  then  new  decisions  are  made  on  this 
new  mesh,  and  the  mesh  of  level  2  is  created,  and  so  on... 

We  consider  an  experiment  where  a  flame  is  ignited  at  the  middle  of  the  top  wall 
of  a  square  chamber  filled  with  a  combustible  mixture  initially  at  rest,  and  propagates 
downwards  in  the  chamber.  Figure  3  shows  the  five-level  computational  mesh  and  the 
temperature  contours  at  two  early  stages  of  the  flame  propagation  (we  refer  to  [26]  for 
more  details  on  this  computation). 
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1 


Figure  3:  (from  [26])  Unsteady  flame  propagation  on  a  five-level  dynamically 
adapted  mesh:  triangulation  and  isotherms  at  two  different  time  levels. 
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The  numerical  method  behaves  well,  although  the  original  unrefined  mesh  is  very 
coarse,  which  makes  the  adapted  mesh  highly  non  uniform.  One  remaining  difficulty, 
which  we  are  currently  investigating,  is  that,  although  the  (second-order  accurate) 
weakly-coupled  multi-component  Roe  scheme  of  Section  3.1.3  is  used  for  the  convec¬ 
tive  terms  approximation,  the  computed  mass  fractions  of  all  species  in  the  mixture  are 
not  always  in  the  interval  [0, 1].  This  is  due  to  the  (unavoidable)  presence  of  obtuse 
angles  in  the  adapted  triangulations,  and  of  diffusive  terms  in  the  governing  equations: 
it  is  indeed  well-known  that  the  classical  triangular  finite-element  approximation  of  the 
heat  equation  is  not  positivity-preserving  if  obtuse  angles  exist  in  the  triangulation  [8]. 
We  are  therefore  in  the  strange  situation  where  we  are  able  to  preserve  the  mass  fraction 
positivity  for  the  convective  terms,  but  not  for  the  dissipative  terms... 

The  next  numerical  example,  taken  from  [27],  again  concerns  a  flame  propagating 
in  a  closed  vessel.  Here,  the  physical  parameters  have  been  chosen  in  order  to  reproduce 
numerically  the  so-called  tulip  flame  instability  which  has  been  investigated  experimen¬ 
tally  in  [35];  in  particular,  the  flame  is  now  thinner,  and  the  Mach  number  of  the  flow  is 
10“3.  Also,  a  different  mesh  adaptation  procedure  is  used:  we  employ  here  the  line-by¬ 
line  adaption  algorithm  of  [3]  (an  example  of  the  adapted  mesh,  corresponding  to  the 
last  but  one  time  level  of  Figure  4,  is  shown  on  Figure  5;  as  one  can  see  on  Figure  4,  this 
pseudo-one-dimensional  adaptive  procedure  is  used  only  once  the  flame  has  reached  the 
horizontal  walls  of  the  rectyangular  chamber).  A  second-order  accurate  fully-coupled 
implicit  Roe  scheme  is  used  in  this  computation,  where  the  tulip  instability  is  actually 
observed,  in  very  good  agreement  with  the  experimental  results  of  [35]  (see  [27]). 


Figure  4:  (from  [27]  Tulip  flame  instability:  flame  history  (reaction  rate  contours 
at  successive  time  levels). 


Figure  5:  (from  [27])  Tulip  flame  instability:  adapted  computational  mesh. 

5.2.  Hypersonic  reactive  flow  with  non  equilibrium  chemistry 

Our  last  example,  taken  from  [IS],  concerns  an  inviscid  hypersonic  flow  around 
a  double  ellipse,  with  non  equilibrium  air  chemistry  and  vibrational  equilibrium  (i.e. 
the  equation  of  state  includes  vibrational  terms).  The  free-stream  Mach  number  is 
equal  to  25;  we  refer  to  [12],  [IS]  and  the  references  therein  for  the  details.  Figure  6 
shows  the  steady-state  Mach  number  contours,  computed  using  a  variant  of  the  first- 
order  accurate  multi-component  Van  Leer  scheme  of  Section  3.2  in  which  an  approach 
using  an  “equivalent-7”  is  employed  to  define  locally  the  parameter  7.  This  calculation 
was  made  with  two  steady  mesh  refinements  to  cluster  points  in  the  detached  shock 
and  canopy  shock  regions.  Again,  the  upwind  finite-element  method  proves  to  be  very 
robust,  and  oscillatory-free  results  are  obtained. 
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Abstract 

This  paper  presents  recent  developments  of  the  numerical  tools  used  to  simulate  hypersonic 
fluid  flows  motivated  by  the  Hermes  shuttle  project.  The  approach  relies  on  the  use  of  un¬ 
structured  meshes  combined  with  upwind  approximations,  especially  the  treatment  of  reactive 
gas  problems  is  emphasized.  Numerical  results  in  two  and  three  dimensions  demonstrate  the 
capacities  of  the  methodology. 


68 


1  Introduction 

The  design  of  a  new  generation  of  space  vehicles  such  as  the  advanced  space  transportation 
system  (NASP)  or  the  European  space  shuttle  program  (HERMES)  requires  the  development 
of  efficient  numerical  flow  solvers  taking  into  account  adapted  thermodynamical  and  chemical 
modelizations.  During  the  atmosphere  reentry  of  such  vehicles  at  high  velocity  and  at  high 
altitude,  dissociation,  ionization  and  excitation  of  internal  energy  modes  of  air  have  to  be 
considered.  The  gas  is  no  more  a  perfect  gas  and  thermal  and  chemical  nonequilibrium  models 
have  to  be  included  in  the  set  of  equations.  When  the  caracteristic  times  of  these  phenomena 
are  small  enough  compared  to  the  fluid  motion  caracteristic  time,  all  the  processes  are  at 
equilibrium  with  their  reverse  processes  and  in  this  case,  one  has  only  to  replace  the  perfect 
gas  law  by  a  general  gas  law. 

The  goal  of  this  paper  is  to  discuss  the  extension  of  an  Euler  flow  solver  developped  in 
the  past  [14]  using  unstructured  meshes  to  handle  equilibrium  and/or  nonequilibrium  reactive 
flow  simulations.  For  perfect  gas  computations,  we  chose  the  Osher  Riemann  solver  as  upwind 
scheme  combined  with  an  unstructured  grid  MUSCL-type  approximation  [4].  This  solver  has 
proved  to  be  very  robust  and  free  of  any  tuning  parameters.  Our  aim  is  to  keep  these  properties 
for  real  gas  simulations  by  deriving  an  adapted  generalization  of  this  scheme. 

Recently,  Abgrall  and  Montagne  [1],  Larrouturou  and  F^zoui  [3|  have  shown  that  for  a 
mixture  of  perfect  gas  components  (conservation  of  mass  of  each  component  is  expressed), 
the  computation  of  Riemann  invariants  (which  is  the  keypoint  to  evaluate  the  numerical  flux 
with  the  Osher  solver)  is  possible  and  is  a  straightforward  extension  of  the  single  perfect  gas 
situation.  The  mixture  constitution  is  only  changed  in  the  contact  discontinuity  and  remains 
constant  along  the  first  and  third  subpathes  as  well  as  the  ratio  of  specific  heats.  Unfortunatly, 
for  a  mixture  of  only  thermally  perfect  gases  (when  specific  heat  coefficients  are  variables  and 
functions  of  temperature),  no  simple  analytic  expression  of  Riemann  invariants  is  available. 
Nevertheless,  it  seems  reasonable  to  advocate  the  use  of  approximate  Riemann  solvers;  this 
approach  has  been  investigated  in  [1]  for  chemical  equilibrium  flows. 

This  paper  presents  at  first  a  possible  generalized  Osher  Riemann  solver  for  a  mixture  of 
thermally  perfect  gases  with  nonequilibrium  chemical  assumption.  Then,  a  simple  generalized 
Osher  solver  is  proposed  when  chemical  equilibrium  is  assumed.  The  following  section  recalls 
the  methodology  of  the  basic  Euler  solver  in  which  the  generalized  Riemann  solvers  will  be 
included.  Numerical  experiments  in  two  and  three  dimensions  of  hypersonic  flow  simulations 
on  adapted  non  structured  meshes  are  presented  to  validate  and  illustrate  the  possibilities  of 
the  reactive  flow  solvers. 

2  Chemical  non-equilibrium  reactive  flows 

We  consider  a  gas  of  mixture  of  N  chemical  species  excluding  ionized  atoms  or  molecules  and 
electrons.  In  this  section,  we  assume  that  chemical  reactions  are  in  non-equilibrium  (what  is 
called  Finite  Rate  Chemistry)  and  that  vibrationnal  excitation  of  molecules  is  in  equilibrium 
with  the  translationnal  one,  that  means  that  the  system  is  characterized  by  a  single  tem¬ 
perature.  At  this  point,  we  make  no  assumption  on  the  use  of  possible  algebraic  equations 
(conservation  of  atoms,  of  mixing  proportions..)  to  reduce  the  number  of  species  appearing  in 
the  system  of  conservation  laws.  We  intend  to  design  an  extension  of  the  Osher  approximate 
Riemann  solver  to  treat  such  a  gas  using  similar  ideas  of  the  approaches  referenced  in  the 
introduction. 
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2.1  Thermodynamic  model 

The  conservation  equations  in  X-D  are  given  as  follows: 


d  a ,  . 

W  +  ai(pu)  =  0 

£(pu)  +  £(/»«* +  P)  =  0 

Tte  +  §i{(e+p)%t)  =  0 

=  o,  for 


where  p  denotes  the  density,  u  the  velocity,  p  the  pressure,  e  the  total  energy  by  unit  volume, 
Y{,  fl<  the  mass  fraction  and  the  production  rate  of  the  ith  species  («  =  1, . . . ,  N).  We  have  the 
identity  Y\  =  !• 


In  term  of  conserved  quantities  W,  the  flux  can  be  written  as: 


where  m  =  pu  is  the  momentum  and  p;  —  pY,  is  the  «th  species  density. 

We  denote  t  the  specific  internal  energy  and  c  —  pc  the  energy  by  unit  volume: 

e  =  \p «*  +  P« 

So,  let  h  be  the  total  enthalpy  by  unit  volume  and  A  the  specific  enthalpy: 

h  =  \pu*  +  pX 

The  pressure  is  a  function  of  density  p,  of  specific  interned  energy  c  and  of  mass  fractions  Yi  : 


P  =  p(p>f»(p«)>= i . N- 1) 

Considering  a  mixing  of  perfect  gaBes,  the  relation  between  pressure,  density  and  temperature 
is  given  by: 

’’•'"'lit  <» 

where  Z  is  the  universal  constant  of  gases,  M,-  the  molecular  weight  of  species  t.  The  temper¬ 
ature  T  arising  in  (1)  is  then  computing  by  the  following  nonlinear  expression  of  enthalpy: 

N  N 

X  =  c+-  =  Y,Y<h%  +  £  Yi  CPi{T)  T 

p  <=i  <=1 
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°r  ^  N 

f  =  £  Y,  h%  +  £  Yt  CVi(T)  T  =  £  Yi  fi{T)  (2) 

i=l  •'=!  *=1 

where  h°{  is  the  formation  enthalpy  at  temperature  To,  CPi  the  specific  heat  coefficient  at 
constant  pressure  and  CVi  at  constant  volume  of  species  «. 

We  will  denote  in  the  sequel: 

*=*£|| 

The  Jacobian  matrix  of  the  flux  with  respect  to  W  can  be  easily  computed.  This  matrix  is 
diagonalizable  with  eigenvalues  ti  —  +  c.  The  expression  of  the  sound  speed  is  found  to 

*,-2  +  (£7f)!+pi&  (3) 

This  expression  can  be  simplified  in  the  considered  context  by  studying  the  partial  derivatives. 

N 

V  =  (/>•)<= l Ar-i)  =  £vjr 

<= iM' 

i  =  £>A(T) 

*=1 

N- 1 

By  considering  the  identity  ps  —  P  ~  ^2  Pit  '‘kese  *wo  equations  become: 


l  =  PfN{T)  +  N^Pi(fi(.T)-fN(T)) 
«=i 

By  deriving  both  equations  (4)  with  respect  to  p,  we  get: 

?p  =  oRbt  ,  Ar 

dp  dp  Mp 

bt  =  -/w(r) 

dp  pC„ 

By  deriving  now  both  equations  (4)  with  respect  to  e,  we  obtain: 

(  dP  _  aR?l 

a?  ~  pR~di 


dT 

di 


1 

pCy 


and  with  respect  to  p,-: 


dp  _  dT_  /j _ 1_V 

dp,  -  P  dPi  +  \Mj  Ms ) 


Bp 
Bpj 

BT  =  Us-h) 
Bpj  pC, 


(4) 
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If  we  insert  these  identities  in  the  expression  of  the  speed  of  sound  (3),  we  obtain  the  simplified 
equation: 

P 
P 

We  then  notice  that  in  the  case  of  non-equilibrium  chemical  reactive  gas,  the  expression  of 
the  speed  of  sound  leads  to  the  definition  of  a  coefficient  given  by: 

This  relation  can  also  be  achieved  for  a  general  multi-temperature  thermo-chemical  nonequi¬ 
librium  system  (see  [10]).  Furthermore,  since  temperature  is  an  homogeneous  function  of  W 
of  degree  one,  so  is  the  coefficient 

2.2  Flux  Jacobian  matrix  properties 

In  order  to  simplify  our  study,  we  restrict  the  following  analysis  to  the  model  that  we  will  use 
in  the  applications.  The  gas  of  mixture  that  we  are  interested  in  is  dissociated  air  constituted 
of  five  species  O,  N,  NO,  Oj  and  IVj  subject  to  chemical  reactions.  Mass  fractions  (V<)»=i...,5 
verify  two  algebraic  relations;  the  first  one  is  simply  due  to  their  definition: 


£*  =  1 

«=i 

already  used  in  the  previous  section.  As  the  air  is  supposed  to  be  a  mixture  of  79%  nitrogen 
and  21%  oxygen,  the  conservation  of  species  gives  the  second  relation: 

£(£  +  £  +  2£).£(£  +  £  +  2£) 

21  \mi  mg  ttx\ /  79  Vmj  ms  ms/ 

where  is  the  molecular  weight  of  the  ith  species. 

The  composition  of  the  mixture  is  then  entirely  determined  by  the  evolution  equations  of 
mass  fractions  of  three  of  the  components.  We  will  keep  the  equations  corresponding  to  the 
three  first  ones:  O,  N,  NO  . 

The  Jacobian  matrix  of  the  flux  has  the  following  expression: 


A(W)  = 


dF(W) 


0 

1 

0 

0 

0 

0 

-«*  +Pp 

2u  +  Pm 

P« 

Pp  l 

Pp, 

Pp, 

u(p,  -  h) 

h  +  UPm 

u(l+pe) 

“Pp. 

“  Pp, 

uPp, 

-u  n 

Vi 

0 

u 

0 

0 

-U  Yi 

Yi 

0 

0 

u 

0 

-uYs 

V, 

0 

0 

0 

u 

where  the  partial  derivatives  of  the  pressure  are  given  by: 

Pt  =  7-1  ;  Pm  =  -(7  -  1)" 


p„  =  -«  - 1)  + c.,mr  +  gg  (C.,(7-)T)  j  +£&.±t  +  ±t. 
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As  said  in  the  previous  section,  this  matrix  is  diagonalizable  with  three  distinct  eigenvalues: 


Ax(W)=u-c,  Aj(W)  =  u  (i  =  2,...,5),  Ag(W)  =  u  +  c 
where  c  is  the  frozen  speed  of  sound  given  by 


The  right  eigenvector  matrix  can  be  written  as 


f  1 

l 

1 

1 

1 

1 

u  -  c 

u 

u 

u 

u 

u  +  c 

R(W)  = 

h  -  uc 

Y i 

k-i 

K 

Yi 

K 

Yi 

*  XJ 
u  — — 

K 

0 

2  XS 

u - — 

K 

0 

h  +  uc 

Vi 

y2 

V, 

0 

V, 

0 

v2 

{  y» 

Vs 

0 

0 

Vs 

Vs 

J 

7  —  1  and 

Xi  =Pp  +  YiP 

for  i  = 

1,2,3. 

2.3  Riemann  invariants 

Since  the  system  of  conservation  laws  is  an  hyperbolic  system,  it  is  important  to  determine 
the  property  of  each  characteristic  field  (degeneracy  or  genuine  nonlinearity).  For  that,  we 
compute  the  sign  of  the  following  quantities: 

VA»(W).R,(W)  =  ~(7+l) 

VA.fWJRfW)  =  0  >  =  2, . . . ,  5 

VA6(W).Ra(W)  =  + 

We  deduce  from  these  quantities  that  the  caracteristic  fields  associated  to  Ai  et  As  are  genuinely 
non-linear  while  those  associated  to  the  eigenvalues  A,-  for  i  =  2, . . . ,  5  are  linearly  degenerated. 
The  variation  of  the  eigenvalue  Aj  (resp.As  )  along  the  associated  path  is  monotone.  It  means 
that  only  one  sonic  point  at  more  exists  where  the  sign  of  Ai  (resp.  Ag  )  changes. 

A  Riemann  invariant  associated  to  the  ith  eigenvector  is  a  function  $  obtained  by  solving 
the  following  equation: 

V*(W)  •  R,(W)  =  0  (5) 

So,  for  the  genuinely  nonlinear  fields,  we  respectively  get: 


dQ  di,  .  „ 

_  +  __(u-c)  +  _(h_uc)  +  g_y.  =  0 

dQ  dQ,  ,  d$,L  .  n 

_  +  +  c)  +  — (fc  +  uc)  +  g— V,  =  ° 


For  i  =  1  or  i  =  6,  mass  fractions  Vy  are  Riemann  invariants,  which  can  be  easily  verified. 
For  i  =  2, ... ,5,  the  velocity  u  and  pressure  p  are  still  Riemann  invariants  as  in  the  perfect 
gas  case.  The  other  invariants  have  not  an  a  priori  analytic  expression  in  the  considered 
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context.  In  the  case  where  the  specific  heat  coefficients  Cv.  (and  then  C'Pl.)  are  constant  (do 
not  depend  on  temperature),  Abgrall  and  Montagu^  [1],  Fezoui  and  Larrouturou  [8)  have 
shown  that  the  invariants  can  be  computed  and  are  the  classic  ones  with  the  parameter  i  as 
defined  previously.  Consequently,  we  will  formally  use  in  our  case  (where  the  specific  heat 
coefficients  are  functions  of  temperature)  these  expressions  of  Riemann  invariants  that  are  only 
now  approximate  Riemann  invariants.  This  choice  is  of  course  not  at  all  unique  but  provides 
simple  expressions  that  will  be  used  to  define  an  extension  of  the  Osher  Riemann  solver.  This 
approximation  does  not  affect  the  consistency  of  the  scheme.  The  expressions  of  the  Abgrall- 
Montagne  Riemann  invariants  are  given  in  Table  1. 


u  —  c 

u 

ti  +  c 

*1  =  ^ 

1  p' 

*1  Pi 

*’  =  7 

=  u 

p 

2 

=  U  +  -  c 
7-1 

$2  =  P 

.  3  2 

*l  =  “-7-lC 

Table  1:  Riemann  invariants 


2.4  Extension  of  the  Osher  Riemann  solver 

To  compute  the  flux  separating  two  states  W j  and  Wr,  Osher  et  al.  [  1 1 )  proposed  a  numerical 
flux  function  that  can  be  written  in  a  condensed  form  as 

*(W,,  Wr)  =  \  [F(Wj)  +  F(Wr)l  -  iy^r|A(W)|dW 

where  |A|  =  A+  -  A-  . 

Instead  of  using  the  exact  path  of  the  Riemann  problem  between  W|  and  Wr,  Osher  proposed 
a  path  consisting  of  two  rarefaction  waves  (corresponding  to  the  two  genuinely  non-linear  fields) 
and  a  contact  discontinuity  (corresponding  to  the  linearly  degenerated  fields).  For  that,  the 
integration  path  between  the  two  states  W t  and  Wr  is  decomposed  in  three  simple  subpathes 

r  =  r,  ur,  urs 


the  director  vectors  of  I\  belonging  to  the  eigenspace  associated  to  the  eigenvalue  A*  .  Eigen¬ 
values  can  be  ordered  in  two  manners  asu-c,  u  ,  u  +  c  (natural  order)  or  as  o  +  c,  u  ,  u  -  c 
(order  adopted  by  Osher  et  al.).  We  will  use  hereafter  the  last  one  that  leads  to  the  following 
parametrization: 


•  First  subpath  Fj  corresponding  to  a  rarefaction  wave  is  defined  by 


rfW(s) 

ds 


Re(W(s)). 


Second  one  fj  is  defined  by 


dW(s) 

ds 


2>R,(W(s)). 

1  =  2 


•  and  the  last  one  Tj  corresponding  also  to  a  rarefaction  wave  is  defined  by 

^i=R1(W(«)). 

We  denote  by  W  jy3,  Wj/S  the  intersection  states  of  the  subpathes  (respectively  intersection 
of  Ti  and  Tj  and  intersection  of  Fj  and  Tj)  and  by  Wj  and  Wj  the  two  sonic  points  i.e.  the 
points  ofT,  (reap.  Tj)  where  Ai  (resp.  Ag)  are  equal  to  zero.  The  integration  over  the  subpathes 
can  be  given  in  a  simple  form  as  follows. 

•  Along  Tj,  the  eigenvalue  Aj  =  u  remains  constant  equal  to  Uj/j.  The  corresponding 

integral  is  independent  on  the  choice  of  parameters  ct,-  and  has  the  expression: 

/  |A(W)|dW  =  /JV,|A(W)|dW  =  sign(Aj(W1/s))  [F(W:/S)  -  F(W1/S)] 

•  The  contribution  on  subpath  Tj  is  given  by: 

f  |A(W)|rfW  =  s/gn(A6(Wi})  [f(W,)  -  F(W,)]+s;gn(Ag(W1/s))  [P(W1/3)  -  F(W,)] 

•  The  contribution  on  subpath  Tj  is  given  by: 

/rJA(W)|dW  =  sign(Ai(Wj/s))  [f(Ws)  -  FtW^j+signtAUW,))  [F(Wr)  -  F(WS)] 


The  integral  of  the  numerical  flux  function  is  at  this  point  completely  defined.  We  have 
now  to  precise  how  to  compute  the  different  states  Wl/fs,  W2/s,  Wi  and  Wj.  For  that,  we 
use  the  approximate  Riemann  invariants  $\2S  defined  previously  and  given  in  Table  1;  since 
they  are  assumed  to  verify  (5),  they  remain  constant  on  their  corresponding  subpath  P,.  The 
following  relations  are  found: 


•  For  the  wave  corresponding  to  u  +  c  : 


PI  _  P 1/3 
Pi""  Pi/i1' 


-  >Vs  - 


u(  -  =  u*/»  - 


•  For  the  wave  corresponding  to  u  -  c  : 


Pr  _  Pz/i 

PS'  ~  Pt/i"  ’ 


n  =  r2/> , 


2  2 
Ur  +  fCr  ~  “J/»  + 


(6) 

(7) 


For  the  contact  discontinuity,  velocity  and  pressure  remain  constant  which  give  the  fol¬ 
lowing  identities: 

“l/s  =  “i/s  »  Pi/s  =  P2/S  (8) 


We  obtain  at  the  end  the  following  nonlinear  scalar  equation: 


2  c, 

•71  -  1 


x  i 


2Cr 

7r-  1 


2  cr  2  ci 

7r  -  1  +  7i  -  1 


where  x  = 


PJJ1 

Pi 


7b 


This  equation  has  been  already  discussed  in  the  previous  work  of  Abgrall  and  Montagne 
[1]  who  analyzed  adapted  resolution  methods.  We  solve  it  by  adequate  dichotomy.  The  other 
unknowns  are  simple  functions  of  z  : 


Pi/i 

=  Pi  X 

,  V  X 

fPl\yr  2L 

2 

Pi/i 

=  (-J  Pt** 

“l/S 

-  -  Cl 

71-1 
r  ^ _ j  l 

“z/s 

=  “1/S 

Pi/i 

=  pi  X*11  1  -  z  a 

Pi/i 

=  Pl/i 

Sonic  points  are  determined  by  expressing  the  corresponding  eigenvalue  is  equal  to  zero  and 
that  Riemann  invariants  remain  constant  on  the  subpath: 


ut  +  ci  =  0 

2  2 

U,  -  - - -Cl  =  U|  -  - - -C| 

71-1  7i-l 


Pi  ^  Pi 
PS1  pi*" 


uj  -  c3  =  0 

2  _  ,2 

“S  +  C - -Cs  =  «r  +  - - ~Cr 

'It  -  1  It  -  1 

Pi  _  Pt_ 

Pi1’  Pr 


We  notice  that  intermediate  and  sonic  points  are  determined  through  density,  velocity, 
pressure  and  mass  fractions.  Temperature  is  then  computed  by  the  state  of  law.  The  last 
quantity  needed  to  compute  the  corresponding  flux  is  energy  which  is  determined  by  equation 
(2)- 


3  Chemical  equilibrium  reactive  flows 


It  is  of  interest  in  applications  to  consider  the  limit  case  of  local  chemical  equilibrium  assump¬ 
tion.  For  many  applications  concerning  the  computation  of  reentry  flow  problems,  this  approx¬ 
imation  is  valid.  In  this  case,  mass  fractions  are  given  by  assuming  equilibrium  fl(W)  =  0. 
We  then  have: 

Yi  ~  Yi(p,pc)  =>  p  =  p(p,pt) 

In  order  to  connect  the  variables  p  and  c  to  the  independent  variables  p  and  c,  it  is  of 
common  use  to  introduce  the  nondimension  variables 


et  7 


We  still  have  the  relation 

p  =  pr  -  i)p< 

This  coefficient  is  not  equal  to  the  ratio  of  specific  heat  coefficients  of  the  gas.  The  speed  of 
sound  c  involved  in  the  expressions  of  the  eigenvalues  u  -  c,u,u  +  c  of  the  Jacobian  matrix  of 
the  system  is  given  now  by  the  differential  formula: 


•■-3+ (4*)  8 


As  in  the  nonequilibrium  case,  we  intend  to  define  an  extension  of  the  Osher  Riemann 
solver.  We  describe  now  in  this  case  our  approach 


The  field  associated  to  the  second  eigenvalue  is  still  linearly  degenerated  with  Riemann 
invariants  u  and  p.  As  in  the  previous  section,  we  do  not  have  analytic  expressions  of  the 
Riemann  invariants  of  the  first  and  third  characteristic  fields.  To  evaluate  the  integrals  of  the 
Osher  solver,  we  propose  an  approximate  computation  of  them  to  define  the  path  of  integration 
as  previously.  We  choose  to  compute  the  Riemann  invariants  as  the  ones  given  in  Table  1  with 
the  parameter  y  taken  equal  to  (^(W  1)  +  i(Wr))/2.  No  nonlinear  equation  has  to  be  solved  in 
this  case.  Intersection  and  sonic  points  of  the  subpathes  are  known  through  density,  velocity 
and  pressure.  In  order  to  compute  the  corresponding  energy,  we  assume  that  the  parameter  7  is 
constant  along  the  first  and  third  subpathes.  In  this  methodology,  we  notice  that  the  chemistry 
routine  which  gives  pressure  and  speed  of  sound  for  one  couple  of  data  ( p,pt )  is  called  only 
twice  at  each  flux  evaluation  i.e.  for  the  states  Wj  and  Wr. 

4  Spacial  Approximation 

We  present  in  this  section  the  main  features  of  a  high  order  approximation  of  the  two-dimension 
Euler  equations  relying  on  an  upwind  formulation  on  an  unstructured  mesh  in  conjunction  with 
Total  Variation  Diminushing  (TVD)  properties.  The  extension  to  the  three  dimension  equations 
is  straitforward. 

Let  Tg  be  a  triangulation  of  the  computational  domain  D  with  boundary  T.  We  can  write 
the  Euler  system  in  a  conservative  form  such  as: 

aw 

— +  v-F(W)  =  n(w) 

The  complete  formulation  can  be  found  in  [4]  and  is  based  on  a  Green  formula: 

Find  Ws  €  (V,)m  such  as  VN*  €  T„ 

R*  =  -L  F(w*)  ■ n*  *  - 1.  f(w»)  ■ nr  d° 

JtC,.  JdCtinr 

where  =  {V,  €  C°{P)  ;  Vf  is  linear  on  each  triangle  }  and  R(  denotes  the  residual.  The 
cell  Cg.  is  defined  for  each  vertex  Nti  €  Tt,  as  the  union  of  the  subtriangles  which  have  Nti  as 
vertex  and  result  from  the  subdivision  of  each  triangle  of  Tt  by  means  of  the  median  planes  as 
shown  on  Figure  1.  The  vectors  n(l.  and  np  designe  outward  normals  of  respectively  the  cell 
Cgi  and  the  domain  boundary  I\ 

The  scheme  will  be  completely  defined  if  we  now  precise  which  approximation  is  used  to 
compute  the  left  hand-side  integral  in  (9).  In  order  to  do  this,  the  boundary  dCti  of  the  cell 
Cg.  is  splitted  in  paneU  dSai-,  joining  the  segment  (Af(i, Nfj.|  to  the  centroids  of  the  triangle 
having  Nti  and  Ntj  as  common  vertices. 

Let  us  give  the  following  notations  : 

Fy(W,)  =  F(W,)./  ntid0  and  P„(W,)  =  J^(Wf)  •  [  n ti  do 

H,  dW  Jes>„ 

An  upwinding  is  introduced  in  the  computation  of  the  convection  term  through  the  numer¬ 
ical  flux  function  #  of  a  first-order  accurate  upwind  scheme  by  : 

f  F(W.)  •  114  do  =  H<;>  =  *,„(Wf„W„) 
where  W(,  =  W  and  W,.  =  W,(W,y) 
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Figure  1:  cell  C,- 


The  numerical  flux  function  used  in  this  scheme  is  the  Osher’s  approximate  Riemann  solver 
[11]  in  the  nonreactive  case  or  its  extension  as  described  in  the  first  two  sections  when  chemical 
effects  are  considered;  this  scheme  has  been  chosen  because  of  its  robustness  and  its  parameter- 
free  implementation.  The  numerical  integration  with  the  upwind  scheme,  as  described  previ¬ 
ously,  leads  to  approximations  which  are  only  first-order  accurate.  A  second-order  accurate 
MUSCL-like  [9]  extension  can  be  defined  without  changing  the  approximation  space: 


Find  W 9e(Vg)m  such  that 

/  ^dx+  £  +  [  F(W,)  nr  d„  =  j  n( W,)  dx 

K  dt  vein,)  Jsc'<nr  Jc: 


(10) 


where  K(i)  ia  the  set  of  neighbours  of  vertex  Nfi,  ^nd  HjV  =  ♦F,y(W#ij,  WPjV). 

The  arguments  WPtj.  and  Wfyi  are  values  at  the  interface  dSPl;  which  have  been  interpolated 
by  using  upwind  gradients  as  described  below. 

We  define  the  downstream  and  upstream  triangles  Ttii  and  Ttji  for  each  segment  [Wp.,  W,y] 
as  shown  on  Figure  2.  Let  the  centered  gradient  be  VW tij  =  VW,  L„  where  Tf  is  one  of 

*»i  *' 

the  triangles  having  Nti  and  Naj  as  vertices. 

A  good  procedure  in  term  of  accuracy  is  to  use  limiters  on  characteristic  variables.  We 
compute  these  variables  by  the  transformation  taken  at  midpoint  of  the  segment.  If  we  denote 
by  n,-,-  the  transformation  matrix  corresponding  to  Piy((WPl  +  WPy)/2),  then  the  values  at 
interface  needed  to  compute  the  flux  are  given  by  : 

W,.,  =  W(i+rL,  Lc„  nr*  (i^vw,  Ir...  +  ^vw(,;)  •  Nj^i 


where  Lc,y,  Ley,  are  the  diagonal  limiting  matrices  introduced  to  reduce  numerical  oscillations 
of  the  solution  and  to  provide  some  kind  of  monotonicity  property.  In  all  computations,  we  use 
the  Van  Albada  limiter  [2)  associated  to  Fromm  scheme  corresponding  to  k  =  0,  combining  a 
certain  monotonocity  property  and  second-order  accuracy  [13]. 
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Figure  2:  downstream  and  upstream  triangles  for  the  segment  [/V,,  JVy] 


Boundary  conditions 

Boundary  integrals  over  T  of  (10)  are  computed  in  order  to  take  into  account  the  physical 
boundary  conditions. 

•  At  inflow  and  outflow  boundaries,  the  integral  is  evaluated  with  a  flux-splitting  applied 
between  exterior  data  and  interior  values  of  the  solution. 


•  On  the  body,  the  slip  condition  is  prescribed;  the  boundary  integral  can  then  be  written 
as  a  pressure  integral  term: 


f  F(W)-n r  <fcr=  f 

Jac^n  r  1  sc,,  nr 


/  0  \ 
Pi  nr 
0 

V  Oj^Af  J 


da 


where  Ut.  is  a  node  on  the  body  and  p,  has  to  be  determined. 

In  terms  of  Riemann  problem,  the  boundary  problem  is  under-defined  since  only  the  left 
state  (taken  equal  to  W(|.)  and  the  wall  speed  are  known.  Nevertheless,  it  is  easy  to 
verify  that  there  is  no  3-wave  and  that  the  contact  discontinuity  is  confounded  with  the 
wall  [6j.  The  1-wave  can  be  either  a  shock  (u<  •  nr  >  0)  or  a  rarefaction  (u*  •  nr  <  0) 
wave. 

In  case  of  perfect  gases,  the  so-called  1/2-Riemann  problem  can  be  solved  analytically 
and  allows  the  use  of  the  Godunov  flux. 


To  take  into  account  reactive  effects,  we  will  use  the  modified  Osher  flux  (§2.4)  since  the 
apparition  of  a  1-shock  wave  would  lead  to  a  non-linear  equation  to  be  solved.  Using 
formula  (7),  we  obtain: 


Pi  =  Pi 


it- 1 
2 


u  •  nr 

e.  . 


5  Numerical  treatment  of  source  terms 

Source  terms  Q,  are  evaluated  according  to  the  model  of  17  chemical  reactions  (dissociation 
and  exchange  reactions)  given  in  [12]  and  prescribed  as  basic  model  at  the  Hermes  Workshop 


The  source  terms  are  treated  implicitely  in  order  to  remove  restrictive  timesteps  limitations 
due  to  the  chemical  part.  The  scheme  is  written  as 

WJ+1  -  W"  i 

-‘a,  '  -  E;' +  n(W») 

As  in  D&ideri  et  al.  [3],  the  term  n(WJyfl)  is  linearized  in  the  following  manner: 

n(w;+‘)  =  n(w")  +  |£(w;.)  (W£+1  -  w™ ) 


which  gives 


(« -  J£<wi>) fW-  -  dfe  *» + “ 


swti  =  w^+1  -  w;. 

This  formulation  leads  to  the  resolution  of  a  3  x  3  system. 


6  Numerical  results 

We  present  a  set  of  typical  computations  obtained  applying  the  methodology  detailed  below  to 
illustrate  the  capacities  of  the  solver  for  both  equilibrium  and  nonequilibrium  real  gas  simula¬ 
tions.  For  equilibrium  flow  simulations,  we  use  the  tabulated  thermochemical  model  developped 
by  Vaneamberg  [15].  All  the  computations  have  been  performed  by  an  explicit  four-stage  time 
stepping  scheme  allowing  the  use  of  Courant  number  of  1.8. 

The  equilibrium  flow  solver  is  going  to  be  validated  in  2D  during  the  Workshop  [5]  with 
computations  around  a  double  ellipse  and  we  only  present  here  three  dimensionnal  experiments. 
The  first  result  presented  is  the  computation  of  the  inviscid  with  local  equilibrium  chemical 
assumpt:on  around  a  forebody  of  the  European  space  shuttle  Hermes  using  the  methodology 
described  above  at  a  freest  ream  Mach  number  of  10  and  at  an  angle  of  attack  of  30°.  The 
iso-Mach  number  lines  in  the  symmetry  plane  are  diplayed  in  Figure  3  and  4.  The  two  pictures 
on  each  Figure  correspond  respectively  to  a  computation  on  an  initial  coarse  grid  and  on  a  final 
one  obtained  after  4  successive  adaptive  refinements.  On  Figure  4,  we  can  clearly  notice  that 
the  canopy  shock  is  well  captured  on  the  final  mesh.  This  shows  that  refinement  is  mandatory 
for  sue!  computations  to  compute  accurate  solutions  on  reasonable  grids.  Then  a  computation 
of  equilibrium  flow  around  the  complete  Hermes  using  the  same  methodology  corresponding  to 
a  freest  ream  Mach  number  of  25  at  an  angle  of  attack  of  30°  and  an  altitude  of  76000  meters 
is  presented.  Two  adaptive  mesh  refinement  procedures  have  been  successively  applied  based 
on  a  cri*  rrion  related  to  the  gradient  of  Mach  number.  The  initial  surface  mesh  is  presented 
in  Figure  5.  The  first  mesh  is  made  of  13770  nodes  and  74659  elements  and  the  final  one  of 
27338  nodes  and  146343  elements.  Surfi^e  iso-Mach  number  lines  are  shown  on  Figure  6  and 
iso-Mach  number  lines  in  the  symmetry  vertical  plane  in  Figure  7  for  the  solution  obtained 
on  the  final  mesh.  A  comparison  of  the  solutions  obtained  on  both  meshes  is  displayed  on 
Figure  8  and  9  through  a  presentation  of  the  iso-Mach  number  lines  in  a  horizontal  plane 
crossing  ths  shuttle  winglets.  One  can  easily  notice  that  adaptive  mesh  refinement  has  been 
active  in  the  winglet  regions  and  ine  the  shock  capture.  The  validation  of  the  equilibrium 
reactive  gas  solver  has  been  performed  by  a  numerical  simulation  of  hypersonic  flows  around 
an  Aeroassisted  Orbital  lYansfer  Vehicle  geometry  for  which  an  accurate  numerical  prediction 
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of  the  aerodynamic  moments  is  required  for  stability.  The  global  final  adapted  mesh  consists 
of  9041  nodes  and  45817  elements  obtained  after  two  refinement  procedures  corresponding  to  a 
computation  at  Mach  number  of  10,  no  angle  of  attack  and  at  75000  meters  altitude.  Figure  10 
presents  the  iso-Mach  number  lines  in  the  symmetry  plane  and  Figure  11  the  pressure  coefficient 
lines.  Maximum  temperature  at  stagnation  point  is  equal  to  4246K. 

We  then  present  results  concerning  nonequilibrium  chemical  Sows.  The  thermodynamic 
condidered  in  these  applications  is  the  one  proposed  in  the  hypersonic  Workshop  [5]  where 
specific  heat  coefficients  are  given  for  molecules  by 

Firstly  a  validation  of  the  solver  is  performed  on  a  shock  tube  problem  proposed  by  Montagne 
defined  by  two  states  at  rest  corresponding  to  a  density  of  0.066,  a  temperature  of  4390 If 
on  the  left  and  a  density  of  0.030,  a  temperature  of  1378A  on  the  right.  Results  concerning 
density,  Mach  number,  pressure  and  temperature  are  presented  on  Figure  12.  No  oscillations 
appear  and  the  obtained  solution  is  monotone.  There  is  stiH  some  numerical  dissipation  when 
compared  with  other  available  results  revealed  by  the  insufficient  sharpness  of  the  rarefaction 
wave.  This  is  certainly  due  to  the  limiting  procedure  which  is  applied  on  the  primitive  variables 
and  not  on  the  characteristic  ones.  The  pressure  remains  constant  in  the  contact  discontinuity 
without  any  oscillation  as  it  should  be. 

The  computation  of  the  hypersonic  flow  around  a  double  ellipse  has  been  performed  with 
the  nonequilibrium  flow  solver.  The  considered  case  is  the  one  proposed  in  the  aforementionned 
Workshop  [5]  at  a  freestream  Mach  number  of  25  and  at  an  angle  of  attack  of  30°.  The  two- 
dimensionnal  mesh  made  of  4257  nodes  is  shown  on  Figure  13.  Iso-Mach  number  lines  are 
presented  on  Figure  14  (AM  —  .25).  The  two  shocks  are  well  captured  but  the  spreading  of 
the  upper  part  of  the  bow  shock  indicates  the  necessity  of  using  adaptive  refinement  in  this 
region.  Next  Figures  (15,  16  and  17)  show  the  wall  values  of  the  mass  fractions  of  species 
produced  by  chemical  reactions  respectively  NO,  N  and  O.  We  notice  that  along  the  wall 
the  amounts  of  O  and  NO  are  nearly  constant.  The  mass  fraction  of  N  decreases  along  the 
double  ellipse  wall  and  is  weakly  affected  by  the  canopy  shock.  Wall  values  of  temperature, 
pressure  coefficient  and  Mach  number  are  displayed  respectively  on  Figures  18,  19  and  20.  The 
temperature  at  stagnation  point  in  less  than  10000 If.  These  results  are  in  good  agreement  with 
other  available  results,  for  exemple  in  [7J;  this  is  quite  satisfactory  because  the  mesh  which  is 
employed  is  undeniably  coarse. 

7  Conclusion 

This  paper  has  presented  an  extension  of  an  inviscid  perfect  gas  flow  solver  to  take  in  account 
reactive  real  gases.  Emphasis  has  been  put  on  the  capacity  of  the  method  to  compute  hypersonic 
flows  around  2D  and  3D  geometries.  We  have  not  at  all  addressed  here  the  efficiency  of  the 
solvers;  development  of  implicit  versions  are  under  investigation. 
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Abstract 

Various  techniques  for  implementing  upwind  flux-split  schemes  for  the  Euler  and 
Navier-Stokes  equations  on  unstructured  meshes  are  reviewed.  The  development  of  a  space- 
marching  technique  on  hybrid  struct ured/unstructured  meshes  is  presented.  In  addition, 
time  integration  algorithms  on  unstructured  grids  with  an  emphasis  on  convergence  accel¬ 
eration  to  the  steady-state  are  compared.  An  m-stage  Jameson  style  explicit  Runge-Kutta 
scheme  is  used  as  a  baseline  comparison.  Implicit  schemes  discussed  include  a  highly  vec- 
torizable  skyline  sparse  matrix  solver,  hybrid  explicit/implicit  time  advancement  schemes, 
and  various  relaxation  strategies.  Mesh  adaptation  techniques  are  also  discussed.  Results 
in  both  two-  and  three-dimensions  are  presented  including  a  supersonic  inlet  calculation 
with  complex  wave  interactions  and  a  space-marching,  inviscid  simulation  on  an  unstruc¬ 
tured  mesh  about  a  high  speed  reconnaissance  aircraft. 


Nomenclature 


a 

c 

D,a 

e 

e 

eo 

F,G,H 

FVGV,  Hv 

h 

ho 

l 

J 

k 

L ,  U 

P 

9 


speed  of  sound 

mass  concentration 

binary  diffusion  coefficient 

internal  energy  per  unit  mass 

equilibrium  portion  of  internal  energy 

nonequilibrium  portion  of  internal  energy 

total  internal  energy  per  unit  mass 

nonequilibrium  energy  production  rate 

inviscid  flux  vectors 

viscous  flux  vectors 

enthalpy  per  unit  mass 

total  enthalpy  per  unit  mass 

identity  matrix 

Jacobian  of  coordinate  transformation 
thermal  conductivity 
LU  decomposition  matrices 
pressure 

velocity  magnitude 
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Q 

n 

t 

T 

u,  u,  w 
i>,  w 

V 

XV 

W 

a 

V(*) 

A 

H 

{,  v,  C 
n,  C 

p 


vector  of  conserved  variables 

residual  for  time  integration  algorithms 

time 

temperature 

cartesian  components  of  velocity 
contravariant  components  of  velocity 
cell  volume 

chemical  production  rate 
vector  of  production  rates 
weighting  coefficient 
gradient(*) 

finite  difference  operator 
dyn.anic  viscosity 
generalized  space  coordinates 
direction  cosines 
density 


Introduction 

There  has  been  a  clear  trend  in  recent  years  toward  the  development  of  algorithms  for 
computational  fluid  dynamic  (CFD)  simulations  that  have  significant  flexibility  in  mod¬ 
eling  problems  with  complex  geometries  and/or  complex  physics.  This  has  led  several 
researchers  away  from  structured  (or  logical)  indexing  schemes  for  addressing  mesh  ele¬ 
ments  to  generalized  indexing  schemes  frequently  referred  to  as  unstructured  techniques 

This  paper  discusses  some  of  the  research  that  has  taken  place  during  the  past  two 
years  at  VPI&SU  concerning  algorithm  development  on  unstructured  and  hybrid  (struc¬ 
tured/unstructured)  grids  for  compressible  flow  simulations.  Three  primary  areas  have 
been  investigated  and  will  be  briefly  discussed.  They  are:  1)  implicit  time  integration 
schemes,  2)  mesh  adaptation,  and  3)  space- marching  methods  on  hybrid  grids.  The  work 
presented  herein  has  been  heavily  influenced  by  contributions  from  many  researchers  in¬ 
cluding,  but  not  limited  to,  the  efforts  of  A.  Jameson,  R.  Lohner,  P.  Roe,  B.  Van  Leer,  T. 
Barth,  B.  Stouflett,  B.  Grossman,  and  J.  L.  Thomas  and  their  co-workers.  In  the  authors 
opinion,  this  work  represents  a  combination  of  some  of  the  best  ideas  presented  by  these 
people. 

The  following  sections  provide  a  brief  discussion  of  the  governing  equations  considered 
along  with  a  description  of  cell-vertex  and  cell-centered  spatial  discretizations.  Various 
time  integrations  schemes  are  presented  and  compared  for  a  simple  transonic  test  problem. 
Results  with  and  without  mesh  adaptation  for  a  supersonic  inlet  are  shown.  Finally,  results 
from  a  novel  space-marching  method  applied  to  a  high  speed  reconnaissance  aircraft  are 
discussed. 
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Governing  Equations 

The  equations  of  motion  of  interest  in  this  paper  are  the  full  Navier-Stokes  (FNSj 
equations  and  subsets  thereof  including  the  Thin-Layer  Navier-Stokes  (TLNS),  Parabolized 
Navier-Stokes  (PNS),  and  the  Euler  equations.  The  integral  form  of  these  equations  may 
be  written  in  the  common  form: 


d_ 

dt 


!!LQdV*i 


F  ■  nds 


WdV 


(1) 


where  Q  is  the  vector  of  dependent  variables,  W  is  a  source  term,  and  F  ■  h  represents 
the  flux  of  mass,  momentum,  and  energy  out  of  the  control  volume  V  through  the  sur¬ 
face,  S,  with  n  an  outward  unit  normal  vector  from  S.  The  algorithms  discussed  here 
always  directly  discretize  the  integral  form  of  the  governing  equations,  but  it  is  frequently 
convenient  for  discussion  purposes  to  rewrite  (1)  in  the  differential  form 


where 


dQ  d(F  -  F„)  £>(G-G„)  d(H  -  Hv) 

dt  d(.  dn  d( 


/  ^ 

(  W'  \ 

P2 

W  2 

PN 

li’/V 

pu 

,  W  = 

0 

pv 

0 

pw 

0 

Pi  Eni 

p\en,  +  en,u>i 

PMenM 

\  peo  / 

^  0  / 

(2a) 


(26) 


For  generality,  a  source  term,  W,  due  to  chemical  reactions  and  nonequilibrium  thermody¬ 
namics  has  been  included.  A  thorough  discussion  of  this  formulation  can  be  found  in  [8]. 
The  common  perfect  gas  form  is  a  simple  subset  of  this  more  general  case.  The  vectors 
F ,  G,  H  represent  the  inviscid  and  pressure  terms  and  Fv,  Gv,  Hv  contain  the  shear 
stress  and  heat  flux  contributions.  As  an  example,  for  an  N-sp«  cies,  finite-rate,  chemically 
reacting  flow  in  which  M  of  the  species  are  considered  to  be  in  vibrational  nonequilibrium, 
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one  may  write 


where 


/  Pii  \ 

(n 

/  pDu^ii  \ 

pD\2C2(k 

| >p 
it 

puu  + 

PVU  +  lyP 
pwu  +  £.-p 

«*> 

ii 

.  < 

^  fy> 

to 

pDuCNi 
put  +  putix/ 3 
pvt  +  putly!  3 
pu>t  +  puttz/3 
pDi2C\teni  +  kniTi( 

pMensf  ^ 

\  puh0  / 

pDi2ci(e„M  -|-  knu  'T\fi 
\  ©  / 

0=[ 

tilk  +  kT,  +  ^2kniTH+p 

j=i 


i=i 


and 


q2  =  u2  -f  v2  +  w2 
U(  =  +  £yvt  + 


The  other  flux  vectors  can  be  written  in  a  similar  manner.  In  the  above,  some  simplifying 
assumptions  have  been  made,  e.g.  the  use  of  a  binary  diffusion  coefficient  as  opposed  to 
a  multi-component  diffusion  model.  However,  the  particular  choice  of  a  chemistry  and 
thermodynamics  model  is  up  to  the  user.  Discussions  of  various  models  and  their  practical 
applications  can  be  found  in  [9-10]. 


Spatial  Discretization 

An  approach  that  has  become  popular  during  the  past  few  years  for  discretizing 
hyperbolic  conservation  laws  is  the  so-called  upwind  technique  in  which  the  numerics  at¬ 
tempt  to  model  the  physics  by  differencing  the  characteristic  information  independently. 
One  advantage  of  such  a  formulation  is  the  increased  robustness  of  codes  that  incorporate 
this  technology,  particularly  in  the  high-speed  regime.  There  are  two  general  classes  of 
upwind  methods,  flux  vector  splitting  (FVS)  and  flux  difference  splitting  (FDS).  Among 
the  FVS  schemes,  the  Steger- Warming  [11]  and  Van  Leer  [12]  splittings  are  the  best  known. 
Rue’s  scheme  [13]  is  by  far  the  most  popular  FDS  technique.  The  splittings  were  origi¬ 
nally  developed  for  the  one-dimensional  flow  of  a  perfect  gas  and  have  been  extended  to 
three-dimensional  generalized  coordinates  and  to  thermo-chemical  nonequilibrium  flows  by 
several  researchers  including  the  author  [c.f.  8]. 
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Upwind  techniques  can  be  implemented  on  structured,  unstructured,  or  hybrid 
meshes  (i.e.,  grids  that  contain  features  of  both).  Both  cell-centered  and  cell-vertex  dis¬ 
cretizations  can  also  be  developed  and  in  a  variety  of  ways.  In  this  paper,  only  one 
cell-centered  approach  and  one  cell-vertex  method  will  be  discussed  in  two  dimensions  on 
a  triangular  mesh  followed  by  a  three-dimensional  technique  on  a  hybrid  mesh.  The  latter 
approach  has  found  particular  utility  for  high  speed  space-marching  simulations. 

The  cell-centered  method  shown  in  Fig.  1  stores  the  dependent  variables  at  the  cen¬ 
troid  of  each  triangle,  the  edges  of  the  triangle  define  the  faces  of  the  control  volume. 


Figure  1.  Typical  control  volume  for  a  cell-centered  technique. 

_  The  cell-vertex  scheme  depicted  in  Fig.  2  stores  the  conserved  variables  at  the  ver¬ 
tices  of  the  triangles.  The  control  volume  is  formed  by  connecting  the  centroids  of  the 
triangles  surrounding  each  vertex  to  the  midpoints  of  the  edges.  This  makes  each  vertex 
the  approximate  cell  center  of  the  control  volume  created  around  it. 


Figure  2.  Typical  control  volume  for  a  cell-vertex  technique. 
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It  should  be  noted  that  these  techniques  can  be  applied  to  both  triangles  and  quadri¬ 
laterals  in  two-dimensions,  and  to  tetrahedra  and  hexagons  in  three-dimensions  and  they 
are  independent  of  the  manner  in  which  the  individual  mesh  elements  and  control  volumes 
sire  addressed  (i.e.  structured  or  unstructured).  Moreover,  these  methods  can  be  applied 
to  even  more  general  elements  and  control  volumes  such  as  the  one  shown  in  Fig.  3. 


Figure  3.  Control  volume  on  a  structured/unstructured  mesh. 


For  space-marching  applications,  one  needs  not  only  a  direction  for  which  the  inviscid 
field  is  entirely  supersonic,  but  also  a  mesh  that  contains  control  surfaces  which  can  be  used 
to  march  the  solution  in  space.  General  three-dimensional  unstructured  grids  do  not  inher¬ 
ently  contain  such  surfaces  whereas  it  is  easy  to  construct  structured  meshes  that  meet  this 
requirement.  Thus,  space-marching  algorithms  on  structured  meshes  have  been  developed 
and  are  popular  because  the  computing  times  required  for  space-marching  calculations 
are  much  less  than  that  required  by  standard  global  iteration  techniques.  However,  the 
requirement  of  a  logical  indexing  scheme  results  in  a  loss  of  geometric  modeling  flexibility 
which  is  a  severe  drawback  of  the  structured  grid  techniques. 

The  control  volume  shown  in  Fig.  3  has  been  obtained  by  constructing  a  hybrid 
structured/unstructured  three-dimensional  grid.  It  consists  of  two-dimensional  unstruc¬ 
tured  triangular  cell  faces  in  each  cross-flow  plane  which  have  been  stacked  together  in 
the  streamwise  (supersonic)  direction  thus  giving  rise  to  mesh  elements  that  are  five-sided 
prisms.  This  grid  can  be  used  for  developing  space-marching  methods  since  each  plane 
can  be  a  surface  on  which  the  Mach  number  based  on  the  local  contravariant  component 
of  velocity  normal  to  all  of  the  cell  faces  is  supersonic.  The  advantage  of  such  a  formu¬ 
lation  is  that  it  combines  the  increased  numerical  efficiency  of  space-marching  algorithms 
with  the  geometric  flexibilty  of  an  unstructured  indexing  scheme.  One  can  also  construct 
other  types  of  hybrid  meshes  for  marching  applications  that  use  different  elements  to  form 
the  base  grid  (e.g.  tetrahedra)  as  long  as  surfaces  can  be  constructed  for  advancing  the 
solution  in  the  streamwise  direction. 


With  any  of  these  discretizations,  one  may  replace  the  integral  form  of  the  governing 
equations  by  the  semi-discrete  approximation 


yd(Qi) 

'  dt 


=  V,W;  -  Y,  FijAtijsRt 

J=«(0 


(3) 


where  (Q,)  is  the  volume  average  of  Q  in  the  i,h  control  volume,  V,  is  the  control  volume, 
Fij  is  the  flux  out  of  the  element  i  though  face  j,  A is  the  area  of  the  jth  cell  face  of 
volume  i,  and  rc(i)  is  a  list  of  neighboring  cells. 

In  order  to  apply  an  upwind  scheme,  it  is  necessary  to  obtain  two  distinct  fluid 
dynamic  states  on  each  side  of  a  cell  face,  frequently  referred  to  as  the  left  and  right  states. 
For  first  order  accuracy,  the  left  and  right  states  may  be  obtained  by  assuming  a  piecewise 
constant  distribution  of  the  state  variables  within  the  control  volumes.  Thus,  one  simply 
obtains  the  left  and  right  states  from  the  either  the  cell-centered  or  cell-vertex  volume- 
averaged  values  immediately  adjacent  to  the  cell  face  at  which  the  numerical  flux  is  sought 
depending  on  the  scheme  employed. 

In  order  to  increase  the  spatial  accuracy,  a  piecewise  linear  distribution  of  the  data 
may  be  assumed.  From  this  reconstruction,  more  accurate  left  and  right  states  may  be 
determined  and  their  values  limited  in  such  a  way  that  no  new  extrema  are  generated 
in  an  effort  to  prevent  spurious  oscillations  in  the  vicinity  of  discontinuities.  This  linear 
distribution  of  the  cell  averaged  flow  variables  can  be  represented  by 

<3(i,y)  =  Q(Io,yo)  + VQr  (4) 

where  r  is  the  vector  from  the  cell  center  (io,  yo)  to  any  point  (x,y)  in  the  cell,  and  VQ 
represents  the  solution  gradient  in  the  cell.  Note  that  this  equation  is  simply  the  first- 
order  accurate  Taylor  approximation  plus  a  higher-order  correction.  For  each  cell,  since 
the  solution  gradient  VQ  is  constant,  it  can  be  computed  from 

VQA  =  j-£QndQ 

where  Sq  is  the  area  contained  in  the  path  of  integration.  For  the  cell-centered  case,  the 
path  chosen  passes  through  the  centroids  of  the  cells  Z?,C,  and  D  which  surround  cell  A, 
as  indicated  in  figure  4. 

For  the  vertex  scheme,  the  path  connects  the  neighboring  vertices  B,C,D,E,F ,  and  G 
as  shown  in  figure  5.  Both  paths  ensure  exact  calculation  of  VQa  when  Q  varies  linearly. 
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Figure  4.  Integration  path  for  cell-centered  gradient  calculation 


Figure  5.  Integration  path  for  cell-vertex  gradient  calculation 


One  can  consider  a  limited  version  of  the  linear  function  about  the  centroid  of  cell  A 

Q(x,y)A  =  Q{x0,y0)A  +  -rA,  $  «  [0,1] 

In  order  to  find  the  value  of  $A,  a  monotonicity  principle  is  enforced  on  the  unlimited 
quantities  QAi  =  Q(in  y,)A  calculated  in  (4)  at  the  faces  of  cell  A.  It  requires  that  the 
values  computed  at  the  faces  must  not  exceed  the  maximum  and  minimum  of  neighboring 
cell  values  including  the  value  in  cell  A ,  i.e., 

Q7n  <  Qa<  <  Q7Z 

where  <9™,n  =  min(QA,  Qne.9A(,0r.)  and  Q^az  =  max(QA,Q„titkior,)  $  can  now  be 
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calculated  for  each  face  j  of  cell  A  as 

.  /.  QTx-Qa\ 
ltn  V  ’  Qj-Qa  )' 


*A>  = 


.  A  Q7"-Qa\ 

""V'^Tcu)' 


if  <3,  -  Qa  >  o 


if  Qj  -  Qa  <  0 
if  Qj  -Qa=  0 


with  =  min($Aj ),  where  j  =  /c(l),  «(2),  /c(3), For  a  more  in  depth  discus¬ 
sion  of  this  higher-order  accurate  scheme,  see  Barth  [3]. 


Temporal  Discretization 

In  order  to  obtain  a  steady-state  solution,  the  governing  equations  must  be  integrated 
in  time.  Seven  time  integration  methods  have  been  considered  :  a  four  stage  explicit 
Runge-Kutta,  a  four  stage  Runge-Kutta  with  implicit  residual  smoothing,  point  Jacobi, 
point  Gauss-Seidel,  a  block  Gauss-Seidel  type  relaxation,  a  fully  implicit  LU  decomposition, 
and  a  hybrid  scheme  which  combines  Runge-Kutta  and  LU  decomposition.  Jameson  style 
Runge-Kutta  is  used  here  which  can  be  written  as: 

q(°)  _  q(N) 

QM  =  QW+aiyR(Q(  °>) 

QW  =  (?(0)+a2^R(Q(«)) 

Q<3>  =  Q<0>+a3^R(Q(2>) 

Q(*)  =  QW+a4^n(Qw) 

<3<a,+')  =  qW 

where  R(Q)  is  the  right-hand  side  of  (3)  and  a  are  weights.  Convergence  to  the  steady- 
state  can  be  accelerated  by  using  a  local  time-stepping  technique  in  which  the  maximum 
permissible  time  step  for  each  individual  cell  in  the  flow  field  is  used,  as  dictated  by  local 
stability  analysis.  In  addition,  the  Runge-Kutta  scheme  can  be  accelerated  by  applying 
implicit  residual  smoothing  at  every  stage  of  the  time  integration.  Residual  smoothing  is 
essentially  a  Laplacian  filtering  of  the  numerical  values  of  the  residuals.  After  every  stage 
a  new  value  of  the  residual  is  obtained  from 

R,  =  Ri  +  cV2R, 

where  Ri  is  the  Laplacian  filtered  value  of  R,.  The  undivided  Laplacian,  V2R,,  can  be 
represented  on  an  unstructured  mesh  as: 

V2R,  =  Y,  (R>~  Ri) 

>=*(•) 
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The  resulting  implicit  equation  for  Ri  is  solved  here  by  Jacobi  iteration.  Typically,  two 
Jacobi  iterations  were  performed  with  e  =  0.5. 

The  point  Jacobi,  point  Gauss-Seidel,  block  Gauss-Seidel,  and  LU  decomposition 
schemes  that  have  been  studied  utilize  the  Euler  implicit  time  integration  algorithm  as  a 
common  starting  point  which  can  be  represented  in  delta  form  as 

At 

where  A Q  =  Q,v+1  —  QN .  The  equation  can  be  linearized  and  written  as 

AAQ  =  Rn 


where 


A  = 


(V  dRN 
\At  dQ 


)' 


The  matrix  is  generally  large  and  sparse  and  has  a  variable  bandwidth.  The  linear 
system  can  be  an  approximate  or  exact  linearization  of  the  residual,  R.  Many  relaxation 
schemes  for  solving  the  linear  problem  rewrite  A  as 


A  =  M  +  D  +  N 


where  M  is  a  lower  triangular  matrix,  D  is  a  diagonal  matrix,  and  N  is  an  upper  triangluar 
matrix.  With  a  point  Jacobi  method,  the  block  matrices  on  the  diagonal  are  inverted  and 
multiplied  by  the  right  hand  side  to  obtain 

AQ  -  D~*RN 


To  implement  a  point  Gauss-Seidel  method,  the  off-diagonal  terms  of  A  are  multiplied  by 
the  current  approximation  to  A Q  and  are  subtracted  from  the  residual.  Point  Gauss-Seidel 
can  be  written  for  i  =  1,  ...,n  as 


AQ[l)  =  D~J 


i-1 


-,(0) 


The  superscripts  on  A Q  refer  to  the  inner  iteration  number  of  the  Gauss-Seidel  method 
on  the  linear  system.  Typically,  A is  an  initial  guess  for  the  Gauss-Seidel  solver  and 
A Q(')  is  used  to  update  QN  to  QN+t.  Due  to  the  recursive  nature  of  the  point  Gauss- 
Seidel  algorithm,  complete  vectorization  of  this  method  is  not  possible.  Point  Gauss-Seidel 
can  be  made  symmetrical  by  sweeping  through  the  list  of  vertices  in  the  opposite  direction 
before  updating  Q.  The  symmetrical  point  Gauss-Seidel  can  be  written  for  i  =  n,...,  1  as 


AQ[2)  =  D-J 


i=i  ;=<+ 1 


(2) 


l 
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with  A QW  being  used  to  update  Q. 

In  order  to  perform  the  block  Gauss-Seidel  type  relaxation  used  here,  it  is  first  useful 
to  renumber  the  cells  to  get  as  many  of  the  elements  of  A  as  possible  into  tridiagonal  form. 
A  typical  matrix  (associated  with  the  transonic  channel  flow  problem  discussed  later)  is 
shown  in  figure  6  for  a  1005  element  mesh.  The  matrix  can  be  subdivided  into  several 
subsections  (denoted  by  the  blocks)  of  variable  length.  A  close  up  of  the  third  section 
shows  the  essentially  tridiagonal  form.  Each  subsection  is  then  solved  by 

TAQ  =  R(Qn,Qn+1) 

where  T  denotes  a  tridiagonal  submatrix,  and  the  residual  becomes  a  function  of  QN  and 
Qn+1  .  Since  the  blocks  are  independent  of  each  other,  the  inversion  of  these  submatrices 
can  be  vectorized  over  the  number  of  blocks.  When  the  matrix  A  is  divided  into  subsections 
of  variable  length,  a  minimum  allowable  length  is  imposed.  The  inversion  procedure  is  then 
vectorized  for  the  elements  in  every  block  up  to  the  minimum  number  of  elements  allowed, 
the  remainder  of  the  elements  are  then  computed  in  scalar  mode. 


Figure  6.  Structure  of  the  linear  system  and  close-up  of  the  3rd  block 

Once  the  tridiagonal  matrices  have  been  inverted,  the  values  of  A Q  are  solved  for 
sequentially  over  each  section  of  the  matrix  A.  The  elements  to  the  left  of  the  diagonal  on 
a  forward  sweep  (or  the  elements  to  the  right  of  the  diagonal  on  a  reverse  sweep)  through 
the  matrix  are  included  implicitly  in  the  solution  procedure.  After  the  values  for  A Q  of 
a  subsection  are  calculated,  the  residual  for  the  next  subsection  is  computed  using  these 
new  updated  values  resulting  in  a  non-linear  update  of  the  residual. 
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The  LU  decomposition  method  solves  the  linear  system  exactly.  A  renumbering 
scheme  (currently  reverse  Cuthill-McGee)  is  used  as  a  preprocessor  for  the  LU  decompo¬ 
sition.  The  solution  procedure  involves  factoring  A  into  the  product  of  a  lower  triangular 
and  upper  triangular  matrix  (L  and  U)  such  that 

[LU}AQ  =  Rn 

and  then  the  system  LAQ *  =  RN  is  solved  by  forward  substitution  and  UAQ  =  A Q* 
by  backward  substitution.  This  procedure  reduces  to  Newton’s  method  as  A<  — ►  oo,  and 
as  a  result  it  exhibits  quadratic  convergence  to  the  solution  of  the  non-linear  system  of 
equations  under  certain  restrictions.  A  simple  change  to  this  scheme  which  has  proven 
to  be  more  efficient  involves  freezing  the  LU  decomposition  in  time  and  performing  only 
forward  and  backward  substitutions  to  advance  the  solution  vector  Q  to  the  steady  state. 
This  approach  can  be  represented  by 

A  Qn+i<  =  -(U-1  L~')n  R(Qn+k  ,Qn+k+1) 

where  N  represents  the  time  at  which  the  LU  is  frozen,  and  N  -f  I{  is  the  current  time 
step. 

It  has  been  found  that  for  many  problems,  a  scheme  involving  the  combination  of 
Runge-Kutta  initially  and  LU  decomposition  with  frozen  decompostion  elements  as  the 
steady-state  is  approached  is  practical.  During  the  initial  transients,  LU  decomposition  is 
not  efficient  because  the  originial  estimate  of  the  solution  is  typically  far  from  the  steady 
state  solution  and  the  work  associated  with  the  inversions  at  the  onset  cannot  be  justified. 

Mesh  Generation,  Adaptation,  and  Graphical  Interface 

The  procedure  used  for  generating  triangular  elements  about  an  arbitrary  configura¬ 
tion  is  an  advancing  front  method  discussed  by  Lohner  [4],  This  technique  requires  input  of 
stretching  parameters  o,  6,  and  s.  The  direction  a  denotes  the  direction  of  stretching,  with 
6  being  the  element  size  at  right  angles  to  a  and  sS  being  the  element  size  in  the  direction 
of  stretching.  These  parameters  are  given  in  the  context  of  a  coarse  background  grid  of 
triangular  elements  which  cover  the  solution  domain.  They  are  first  used  to  place  nodes 
along  the  boundaries  of  the  computational  region.  The  sides  connecting  these  nodes  form 
the  initial  generation  front.  Elements  are  added  by  interpolating  the  stretching  parameters 
from  the  background  grid,  the  entire  front  is  updated,  and  the  process  repeated.  Once  the 
entire  domain  has  been  triangulated,  a  smoothing  routine  is  performed  to  improve  the 
quality  of  the  generated  mesh. 

The  grid  generation  process  described  requires  input  of  a  background  mesh  and 
stretching  parameters.  In  general,  a  very  simple  background  mesh  with  no  streching  is 
input  for  computing  the  initial  grid.  The  initial  grid  will  then  become  the  background  grid 
for  the  first  mesh  regeneration.  This  process  requires  utilizing  a  measure  of  the  error  in  the 
computed  solution  on  the  initial  grid  to  obtain  the  parameters  a,  6,  and  s.  An  improved 
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mesh  can  then  be  generated  with  the  initial  grid  as  a  background  grid  along  with  these 
computed  parameters. 

If  the  density  p  is  chosen  as  the  variable  used  to  determine  the  error,  then  an  estimate 
for  the  root  mean  square  value  of  the  local  error  can  be  given  by 


E!*ms 


where  /i;  is  the  local  element  size,  and  p  denotes  the  computed  solution.  A  criterion  for  a 
uniform  value  of  the  error  over  the  entire  domain  would  then  be 


which  suggests  that  the  value  6  on  the  new  mesh  should  be  computed  so  that 


For  multi-dimensional  problems,  this  approach  is  employed  in  each  direction  separately. 

This  mesh  regeneration  procedure  is  generally  performed  several  times  in  order  to 
obtain  a  well  refined  mesh  in  as  little  of  CPU  time  as  possible.  At  each  remeshing  stage  the 
solution  on  the  old  grid  is  interpolated  onto  the  newly  generated  grid.  This  is  accomplished 
by  simply  using  the  values  of  the  nearest  point  on  the  old  mesh  as  the  values  at  each 
point  on  the  new  mesh.  This  is  not  conservative,  nor  is  it  extremely  accurate,  but  it  is 
very  fast,  and  the  error  introduced  is  dominated  by  high  frequencies  and  hence  will  be 
damped  out  quickly.  It  should  be  emphasized  that  though  the  intermediate  interpolation 
is  not  conservative,  the  solution  on  the  final  mesh  will  be  conservative.  This  process 
is  repeated  a  fixed  number  of  times  before  turning  the  remeshing  off  and  converging  the 
solution  to  the  steady-state  on  the  final  mesh. 

It  is  very  difficult  to  predict  when  and  how  many  remeshing  stages  will  be  required. 
Therefore,  a  graphics  interface  has  been  developed  to  allow  the  user  to  have  a  certain 
degree  of  control  in  guiding  the  flow  solver.  For  the  purpose  of  portability,  all  of  the  device 
dependent  graphics  subroutines  have  been  separated  from  the  main  flow  solver.  Currently, 
three  separate  graphic  libraries  have  been  developed:  a  Sun  GKS,  an  Iris4D,  and  a  DI- 
3000  library.  The  DI-3000  graphics  package  is  device  independent  and  will  work  on  any 
computer  with  the  DI-3000  software  installed  such  as  the  NASA  Cray-2  supercomputers 
Voyager  and  Navier. 

When  the  flow  solver  is  implemented,  the  user  has  the  ability  to  execute  in  either 
automatic  or  interactive  mode.  If  the  automatic  mode  is  chosen,  the  solver  will  execute 
without  user  interaction.  If  interactive  mode  is  chosen,  the  solver  will  pause  for  input  just 
before  each  remeshing  step  is  to  take  place.  The  user  then  has  the  ability  to  view  the 
current  solution  and  mesh,  and  then  decide  whether  to  remesh  or  continue  computing  on 
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the  current  mesh.  In  this  way,  the  user  is  able  to  guide  the  solver  avoiding  unnecessary 
or  premature  remeshing. 


Computational  Results 


Transonic  Channel  Flow 

Convergence  rates  for  an  inviscid  flow  over  a  transonic  (Moo  =  0.85)  circular  arc 
in  a  channel  on  a  1005  element  mesh  for  first  order  accurate  cell-centered  and  cell- vertex 
schemes  are  compared.  Figv  compares  the  norm  of  the  residual  versus  Cray-YMP 
CPU  time  for  various  time  integration  strategies  using  both  a  cell-centered  and  a  cell- 
vertex  finite  volume  discretization.  For  the  cell-centered  calculations,  the  plot  compares  the 
Runge-Kutta,  the  LU  decomposition,  the  tridiagonal  block  Gauss-Seidel  type  relaxation, 
and  the  hybrid  Runge-Kutta/LU  methods.  Results  are  also  given  in  this  plot  for  both  the 
LU  decomposition  and  the  tridiagonal  algorithms  utilizing  a  frozen  Jacobian  matrix.  The 
results  clearly  demonstrate  the  utility  of  using  the  combined  Runge-Kutta/LU  strategy 
over  the  other  schemes. 

For  the  cell-vertex  calculations,  the  plot  compares  the  Runge-Kutta,  the  Runge- 
Kutta  with  residual  smoothing,  the  point  Jacobi,  and  the  point  Gauss-Seidel  schemes. 
The  point  Jacobi  and  point  Gauss-Seidel  methods  also  have  the  capability  to  reuse  the 
Jacobian  matrices  and  to  alternate  the  sweep  direction.  The  fastest  method  in  this  case 
is  the  symmetric  point  Gauss-Seidel  method  with  Jacobian  reuses.  This  method  compares 
similarly  to  the  block  Gauss-Seidel  scheme  from  the  previous  plot.  A  more  thorough 
discussion  of  these  and  other  convergence  acceleration  methods  can  be  found  in  (7). 

Supersonic  Inlet 

The  purpose  of  this  test  case  is  to  demonstrate  the  utility  of  remeshing  for  a  problem 
with  complex  wave  interactions.  Figure  8  shows  the  first  order  adaptive  remeshing  sequence 
beginning  on  the  initial  85  element  uniform  mesh  (figure  8(a))  and  finishing  on  the  4211 
element  mesh  (figure  8(i)).  The  first  few  grids  in  the  sequence  develop  the  initial  shock, 
expansion  and  reflection  and  the  latter  remeshes  better  define  these  phenomenon.  Figure 
9  shows  the  higher  order  remeshing  sequence  which  starts  with  the  final  first  order  solution 
and  ends  on  a  mesh  with  6247  elements.  The  higher  order  remeshes  make  a  significant 
contribution  in  refining  all  of  the  shocks,  especially  those  toward  the  exit. 

The  remeshing  strategy  took  approximately  the  same  CPU  time  (700  sec)  as  the 
solution  computed  on  the  6263  element  uniform  mesh  shown  in  figure  10.  Since  the  adapted 
mesh  has  more  elements  near  the  shocks,  and  since  these  elements  are  stretched  in  the 
direction  of  the  shocks,  the  pressure  contours  for  this  case  display  much  better  resolution 
than  the  solution  on  the  uniform  mesh. 
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Cray  VMP  Cpu  time  (sec)  Cray  YMP  Cpu-time  (sec) 

Figure  7.  Residual  versus  CPU  time  for  a  transonic  channel  flow. 

Model  SR71  Aircraft 

As  an  example  of  the  use  of  a  space-marching  method  on  a  structured/unstructured 
mesh-and  to  demonstrate  the  geometric  flexibility  of  this  algorithm,  the  solution  about 
a  simplified  model  of  the  SR71  reconaissance  aircraft  has  been  calculated.  This  model 
includes  a  region  of  multiple  elements  in  the  streamwise  direction  at  the  start  of  the 
engine  inlets,  as  well  as  multiple  vertical  tails.  Both  of  these  geometries  cause  difficulties 
with  a  structured  discretization,  while  they  impose  no  restrictions  with  an  unstructured 
discretization.  The  solution  was  calculated  in  a  Mach  3.5  free  stream  at  a  0°  angle  of 
incidence  relative  to  the  root  chord.  This  solution  was  obtained  on  a  grid  with  a  total 
of  42  cross  flow  planes.  Figure  1 1  shows  the  structured  nature  of  the  grid  on  the  surface 
of  the  body  and  contours  of  pressure  in  the  exit  plane.  Also  shown  jure  shaded  pressure 
contours  along  the  body  surface  and  exit  plane  along  with  the  unstructured  triangular  grid 
in  this  plane.  It  is  noteworthy  that  this  calculation  has  been  performed  on  a  wide  range 
of  computers  including  an  IRIS-4D  graphics  workstation  since  both  the  memory  and  CPU 
time  required  by  this  approach  are  minimal.  Other  examples  and  comparison  with  both 
theory  and  experiment  with  this  space-marching  method  are  presented  in  [5J. 

Concluding  Remarks 

The  development  of  algorithms  on  unstructured  and  hybrid  meshes  for  compressible 
flow  simulations  has  become  a  popular  research  subject  in  addition  to  the  development 
of  the  actual  grid  generators.  This  attention  is  due  to  the  significant  improvements  that 
are  possible  by  these  approaches.  Rapid  and  robust  three-dimensional  grid  generation 
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(a)  Mesh:  85  elements,  67  nodes 


(b)  Pressure  contours 


[kj?  5v’- ,'•  i*  *5j 


(c)  Mesh:  453  elements,  268  nodes 


(d)  Pressure  contours 


(g)  Mesh:  3754  elements,  1980  nodes 


(h)  Pressure  contours 


(i)  Mesh:  4211  element*,  2219  node* 


(j)  Pressure  contours 


Figure  8.  Grids  and  pressure  contours  from  a  l*1  order  accurate  remeshing  sequence. 
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(a)  Mesh:  4211  elements,  2219  nodes 


(b)  Pressure  contours 


(c)  Mesh:  5714  elements,  3005  nodes 


(d)  Pressure  contours 


(e)  Mesh:  6247  elements,  3275  nodes 


(f)  Pressure  contours 


Figure  9.  Grids  and  pressure  contours  from  a  2nd 


order  accurate  remeshing  sequence. 


appears  to  be  a  primary  topic  requiring  further  attention  although  a  few  research  groups 
have  demonstrated  impressive  results.  In  general,  however,  the  CFD  community  does  not 
appear  to  have  this  technology  well  in  hand  although  significant  resources  are  now  being 
put  into  this  effort.  The  flow  solvers  for  unstructured  meshes  are  progressing  rapidly  and 
do  not  appear  to  be  hendering  the  transfer  of  technology  to  the  user  community. 
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FOR  IWISCID  AM)  VISCOUS  HYPERSONIC  FLOWS 
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SUMMARY  We  plan  to  show  that  the  use  of  space-centered  approxi¬ 
mations  can  provide  an  accurate  and  efficient  way  to  compute 
compressible  flows  with  shocks,  even  at  large  Mach  and  Reynolds 
numbers . 

First,  we  shall  present  the  basic  centered  method  we  use  for 
solving  the  Euler  equations.  It  is  a  2  time-level  implicit  finite- 
volume  method  which  is  conservative,  second-order  accurate  and  al¬ 
ways  linearly  stable  in  any  number  of  space  dimensions.  When  applied 
to  transonic  aerodynamics,  it  gives  non-oscillatory  solutions  with 
sharp  shock  profiles  (over  one  or  two  mesh  cells)  -  see  III. 

Then,  we  shall  describe  two  modifications  of  this  Euler 
solver  that  we  have  recently  investigated  for  the  calculation  of 
hypersonic  flows.  The  first  one  is  the  addition  of  a  local  entropy 
correction  1 2 1  which  preserves  second-order  accuracy  and 
unconditional  stability.  This  correction  enforces  a  discrete  entropy 
inequality  at  steady-state  (proved  for  a  multidimensional  hyperbolic 
system  of  conservation  laws  with  a  convex  entropy,  in  a  general 
structured  mesh) .  The  second  modification  of  the  basic  Euler  solver 
concerns  the  introduction  of  a  third  time-level  to  improve  the 
robustness  of  the  method  without  altering  the  approximation  at 
steady-state. 

Finally,  we  shall  consider  the  extension  to  the  Navier-Stokes 
equations  with  a  particular  attention  to  stability  and  convergence 
rate  questions. 

Numerical  applications  to  hypersonic  problems  will  be  presen¬ 
ted  for  invitcid  and  high  Reynolds  laminar  flows. 

Ill  A.  LERAT  and  J.  SIDES  -  "Efficient  solution  of  the  steady  Euler 
equations  with  a  centered  implicit  method”,  in  Numerical  Methods 
for  Fluid  Dynamics  111,  X.W.  Morton  and  M.J.  Baines  Eds,  Clarsndon 
Prsss-Oxford  (1988),  p.  65-86. 


121  K.  KHALFALLAH  and  A.  LERAT  -  "Corrsction  d'entropie  pour  des  sche¬ 
mas  numAriques  approehant  un  systAme  hyperboliquo",  C.R,  Acad.  Sc. 
Paris,  308  IX  (1989?,  p.  815-820. 
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Abstract 

We  present  unsteady  numerical  resolutions  of  2D  or  3D  Navier-Stokes  equations,  in  order  to  study 
the  transition  to  turbulence  in  various  free-shear  layers.  The  following  cases  are  envisaged: 

a)  A  two-dimensional  mixing-layer  with  periodic  boundary  conditions  in  the  flow  direction  ( temporal 
mixing  layer).  The  numerical  code  uses  finite  differences  methods. 

b)  Two-dimensional  spatially-developing  mixing  layers,  wakes  and  jets. 

c)  A  two-dimensional  compressible  mixing  layer. 

d)  A  three-dimensional  incompressible  temporal  mixing  layer  and  wake  (direct  and  large-eddy 
simulations,  pseudo-spectral  methods). 

In  all  the  cases,  turbulence  develops  from  a  random  perturbation  of  small  amplitude  and 
broad-band  spectrum  superposed  upon  a  basic  flow. 

In  the  two-dimensional  case,  it  is  shown  that  the  coherent  structures  develop  from  the  Kelvin- 
Helmholtx  instability.  They  undergo  successive  pairings,  are  shown  to  be  unpredictable,  and  possess 
in  the  case  of  the  mixing  layer  a  broad-band  spatial  spectrum  of  slope  comprised  between  k~3  and 
k~*. 

The  compressible  calculations  show  an  inhibition  of  the  instability  above  a  convective  Mach 
number  of  0.6,  in  good  agreement  with  earlier  experiments  and  calculations. 

In  the  incompressible  three-dimensional  case,  direct-numerical  simulations  at  low  Reynolds 
numbers  allow  to  show  how  hairpin  vortiee*  are  strained  longitudinally  between  the  big  rollers  in 
both  cases  of  the  mixing  layer  and  the  wake.  High  Reynolds  numbers  can  be  reached  with  the  aid  of 
a  spectral  subgrid-scale  eddy-viscosity.  It  is  shown  in  this  case  that  the  above  coherent  structures 
survive,  and  that  the  kinetic  energy  cascadee  towards  the  subgrid  scales  along  a  Kolmogorov  fc-5/* 
spectrum. 

1  Introduction 


In  the  study  of  turbulent  free-shear  flows,  there  has  been  during  the  last  15  years  a  growing 
interest  brought  to  coherent  tinctures,  that  is,  structures  having  a  recognisable  shape  for  times 
much  longer  than  their  turn-overtime1.  These  coherent  structures  exist  in  particular  in  mixing 
layers  (Brown  and  Rashko,  1974),  where  they  appear  as  spiralling  vortices1,  jets  and  wakes  (Perry 
et  al.,  1982),  They  are  extremely  common  in  aeronautics,  for  instance  after  the  detachment  of  a 
boundary  layer,  or  in  separated  flows.  They  play  an  important  role  in  combustion,  where  they 
determine  the  flame  fronts,  and  in  accoustics,  they  are  largely  responsible  for  the  generation  of 

1  In  my  definition  of  the  coherent  structures,  I  require  also  that  they  should  be  unpredictable. 

1  hereafter  called  Kelvin-Helmholts  vortices 
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novae.  They  are  also  found  in  the  atmosphere  (cyclonic  or  anticyclonic  perturbations),  in  the  ocean 
(mesoacale  eddies),  and  in  planetary  atmospheres  (Jupiter’s  Great  Red  Spot  or  Neptune’s  Dark 
Blue  spot).  An  important  characteristic  of  these  coherent  structures  is  their  highly  mixing  and 
unpredictable  characters:  for  these  reasons  they  are  an  essential  component  of  the  turbulence. 

More  recently,  another  type  of  coherent  structure  has  been  discovered  in  free- shear  flows, 
namely  three-dimensional  hairpin-shaped*  longitudinal  vortices  (Breidenthal,  1981,  Bernal  and 
Roshko,  1986).  These  structures  seem  to  play  an  important  role  during  the  transition  process  to 
small-scale  turbulence. 

The  coherent  structures  have  been  first  discovered  in  the  experiments.  However,  the  increas¬ 
ingly  fast  development  of  computational  fluid  dynamics  allows  now  spectacular  progresses  in  the 
understanding  of  the  role  and  dynamics  of  coherent  structures  in  turbulent  shear  flows. 

2  The  two- dimensional  free- shear  layers 


2.1  Numerical  methods 

We  solve  numerically  in  a  rectangular  domain  the  two-dimensional  Navier-Stokes  equation 

=  (1) 

where 

w  =  -VV(*,y,t)  (2) 

is  the  vorticity,  and  the  stream  function  of  the  flow.  The  D/Dt  operator  is  the  Lagrangian 
derivative  following  the  fluid  motion,  given  by 

+  ■  <*> 

where  J  ( A,B )  =  ( dA/dx)(dB/dy )  -  dB/d x)(dA/dy)  is  the  Jacobian  operator.  The  initial  basic 
flow  (in  a  periodic  calculation)  or  inflow  (in  a  spatially-growing  calculation)  may  be  a  hyperbolic- 
tangent  velocity  profile  (mixing  layer),  a  (1/cosh*  y)  profile  (plane  jet)  or  a  gaussian  deficit 

velocity  profile  (wake).  To  these  profile,  we  superpoee  initially  (in  the  periodic  calculations)  or  at 

each  time  step  (in  the  spatially-growing  calculations)  a  random  perturbation  of  weak  amplitude 
(white  noise),  modulated  in  the  y  direction  by  a  gaussian  function  of  width  &.  In  fact,  this  study 

*  Hairpin  structures  are  also  found  in  the  boundary  layer  both  during  the  transition  to  turbulence 
(Klebanoff  et  al.,  1962)  and  in  developed  turbulence  (Kline  et  ai.,  1967).  They  will  not  be  considered 
here. 
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concerns  both  the  nature/  transition  to  turbulence4,  and  also  the  generation  of  coherent  structures 
in  a  developed  shear-layer,  due  to  the  instability  of  the  mean-velocity  profile.  Boundary  conditions 
at  the  lateral  boundaries  are  of  the  tree-slip  type.  Numerical  methods  consist  of  finite-differences  on 
a  regular  grid,  and  are  described  in  Comte  et  al.  (1988)  and  Comte  (1989).  The  Jacobian  in  eq.  (3) 
is  calculated  using  Arakawa’s  scheme,  which  conserves  kinetic  energy  and  enstrophy  in  the  domain. 
In  the  spatially-developing  calculations,  the  outflow  boundary  condition  is  of  the  Sommerfeld  type. 

The  diffusion  of  a  passive  scalar  9,  called  here  temperature,  is  simultaneously  studied.  The 
latter  satisfies  an  equation  analogous  to  (1) 

2-9  =  kV*9  (4) 

where  k  is  the  molecular  diffusivity.  The  Prandtl  number  v/k  will  be  taken  equal  to  1.  The 
initial  temperature  profile  is  identical  to  the  basic  velocity  profile,  and  allows  to  visualize  the  flow 
structures  in  the  same  manner  as  a  numerical  dye  would  do  in  the  experiments. 

Calculations  are  done  here  at  Reynolds  numbers  (based  on  the  initial  vorticity  thickness)  of 
1000  or  500.  Notice  that,  in  a  previous  study,  we  did  also  calculations  where  the  dissipative  term 
i /V*w  in  (1)  was  replaced  by  -i/i V4w  (see  Lesieur  et  al.,  1988).  This  modification  is  made 
frequently  in  two-dimensional  turbulence  studies  related  to  oceanography  and  meteorology,  and 
shifts  the  dissipative  effects  towards  the  smallest  scales  close  to  the  calculation  mesh  Ax.  However, 
the  results  are  not  substantially  different  at  the  Reynolds  numbers  considered  here,  as  far  as  the 
coherent-structure  dynamics  is  concerned. 


2.2  Coherent-structures  dynamics 

Before  looking  at  the  results  of  the  calculation,  it  is  of  interest  to  recall  the  main  results  of  the 
linear-instability  theory  applied  to  the  periodic  free-shear  flows.  In  a  fluid  of  uniform  density  p0  , 
we  consider  the  stability  of  a  parallel  flow  of  components  <l(y),0,0  ,  upon  which  is  superposed  a 
small  perturbation.  This  perturbation  is  assumed  to  be  two-dimensional  in  the  (x,  y)  plane,  with 
a  stream  function  V'fx,  y,  t)  of  the  form 


4>  =  e  *(y)  expi  a(x  -  et)  , 


(5) 


corresponding,  to  a  perturbed  velocity  field  of  components  u  =  d$/dy,v  =  -dji/dx,  0  ,  with: 


u  =  c 


d* 

dy 


expi  a(x  -  et), v  = 


ia$(y)  expi  a(x  -  ct) 


(6) 


e  <<  1  is  a  small  non-dimensional  parameter,  a  is  real,  and  is  the  spatial  longitudinal  wave  num¬ 
ber  of  the  perturbation:  this  is  a  temporal  analyeit,  by  opposition  to  a  epatial  analysis  where  a 

4  where  the  white  noise  models  the  residual  incoming  turbulence  in  an  experimental  apparatus, 
which  injects  energy  in  all  the  unstable  modes 
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is  complex.  In  this  temporal  study,  e  is  complex,  of  real  and  imaginary  parts  e,  and  c,.  acr  is 
the  phase-speed  of  the  perturbation,  while  ae<  is  its  temporal-growth  rate.  Notice  that,  within 
the  linear-instability  analysis,  it  is  not  necessary  to  consider  the  growth  of  three-dimensional  per¬ 
turbations,  which  are  always  less  amplified  than  two-dimensional  perturbations  (Squire’s  theorem, 
see  Drasin  and  Reid,  1981).  This  is,  however,  no  longer  true  in  compressible  flows:  for  instance, 
the  temporal-compressible  mixing  layer  admits,  at  a  Mach  number  Me  =  U/c  >  0.6  (U  is  half 
the  velocity  difference),  oblique  waves  which  are  more  unstable  than  the  two-dimensional  waves,  as 
shown  by  Sandham  and  Reynolds  (1989). 


We  assume  that  the  total  velocity  field  (basic  flow  +  pertubation)  is  two-dimensional  (no 
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z-dependance).  Its  vorticity  is 


w  =  ——+&•,  Ci  =  -V1^  (7) 

dy 

The  vorticity  equation  (1)  is  thus  expanded  to  the  first  order  with  respect  to  e.  We  assume  that 
the  basic  flow  is  a  solution  of  the  Navier-Stokes  equation.  Therefore,  it  is  easily  obtained: 

Bui  ,  ,  dCi  ,d£5 

I+i(»)aI+^  =  ,,v"  •  <8) 

which  can  be  written  as  (assuming  that  a  /  0): 

!«(»)- e)  (0 •  (9) 


which  is  the  traditional  form  of  the  Orr-Sommerfeld  equation.  This  equation  can  be  solved  nu¬ 
merically,  for  various  basic  velocity  profiles:  for  the  hyperbolic-tangent  mixing  layer,  corresponds 
to  a  given  Reynolds  number  a  range  of  unstable  wave  numbers  a.  The  amplification  rate  aci  is 
maximum  for  a  certain  wave  number  a„,  called  the  most-amplified  mode:  it  is  this  mode  which 
will  emerge  the  first  when  the  initial  perturbation  is  a  white  noise,  or  a  random  perturbation  of 
broad  spatial  spectrum.  This  is  a  very  efficient  way  of  selecting  the  associated  spatial  wave  length 
A.  =  2*/aa  .  When  the  Reynolds  number  exceeds  a  value  ~  30  (for  the  mixing  layer)  and  ~  100 
(for  the  jet  or  the  wake),  A.  is  Reynolds  number  independant,  and  scales  on  Si  ,  the  initial  vortic¬ 
ity  thickness.  Therefore,  the  vortical  layer  of  width  Si  will  first  oscillate,  then  roll  up,  by  vorticity 
induction,  and  form  a  street  of  either  Kelvin-Helmholts  vortices  (in  the  mixing-layer  case)  or  a 
K  arm  an  street  (in  the  wake  or  jet  case). 

Figure  1  shows  the  simultaneous  evolution  of  the  vorticity  and  temperature  fields,  in  a  mixing- 
layer  calculation  involving  initially  8  fundamental  eddies:  the  eddies,  once  formed,  undergo  suc¬ 
cessive  pairings,  in  which  they  turn  around  each  other  and  amalgamate.  One  verifies  also  that 
the  temperature  wraps  around  the  vorticity  concentrations.  These  images  present  striking  resem¬ 
blances  with  experimental  visualisations  of  the  mixing  between  two  chemically-reacting  flows,  done 
by  Koochesfahani  and  Dimotakis  (1986).  Another  important  remark  concerns  the  unpredictable 
character  of  the  pairing  interaction,  and  the  fact  (when  looking  at  the  subsequent  evolution  of  the 
layer)  that  eddies  of  different  rise  will  preferentially  pair:  this  may  be  responsible  for  structures 
involving  three  eddies,  when  two  eddies  finishing  to  merge  will  pair  with  a  third  one. 


Figure  2  shows  an  incompressible  spatially  growing  mixing  layer  once  a  statistically  statio¬ 
nary  regime  is  established.  The  visual  spreading  rate  is  found  to  be  in  good  agreement  with  the 
experiments  on  natural1  mixing  layers.  Figure  3  shows  the  vorticity  field  in  a  spatially  growing  jet. 

*  that  is,  unforced 
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Figures:  numerical  simulation  of  an  ineompreeeible  spatially  growing  mixing  layer:  passive  scalar  contours. 

We  have  studied  in  Staquet  et  al.  (1985)  and  Lesieur  et  al.  (1988)  the  longitudinal*  spectra  of 
kinetic  energy  and  passive  temperature  in  a  two-dimensional  periodic  mixing  layer:  it  was  shown 
that  the  kinetic  energy  spectrum,  initially  formed  of  a  peak  at  the  fundamental  mode  aa,  and  of 
its  harmonics,  develops  an  inertial  range  at  the  end  of  the  first  pairing.  The  slope  of  this  range  is 
comprised  between  k~9  and  k~*. 


Fignre3:  numerical  simulation  of  aa  incompressible  spatially  growing  plane  let:  vortieity  field  (courtesy  P. 
Alexandre,  1989). 


3  Two-dimensional  compressible  periodic  mfrrfwg  layer 


We  have  developed  a  finite-differences  numerical  code  allowing  to  solve  the  full  compressible 
Navier-Stokes  equations.  The  spatial  scheme  is  a  centered  second-order  scheme,  and  the  temporal 

e  in  the  z  direction 
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one  is  a  third-order  Runge-Kutta  scheme  with  reduced  storage.  Here  (see  also  Norman d  et  al., 
1989),  this  code  is  applied  to  a  two-dimensional  temporal  mixing  layer  of  uniform  temperature 
initially.  It  was  shown  by  Papamoschou  and  Roahko  (1988)  that  the  relevant  parameter  to  describe 
the  spreading  of  the  layer  is  the  convective  Mach  number,  equal  to  Mc  =  (Ui  -  U%)/2c  in  this 
simplified  isothermal  case.  Calculations  show  that  there  is  an  inhibition  of  the  Kelvin-Helmholtz 
instability  above  Mt  =  0.6  :  eddies  hardly  roll  up  and  they  merge  without  spiralling  about  each 
other,  as  it  is  the  case  in  the  incompressible  case.  This  seems  to  be  related  to  the  sharp  transition  in 
the  mixing-layer  structure  found  experimentally  by  Papamoschou  and  Roshko  (1989)  at  this  Mach 
number.  Since,  as  already  stressed,  the  three-dimensional  instabilities  grow  faster  above  Mc  —  0.6, 
it  is  necessary  to  develop  three-dimensional  computations  in  order  to  understand  the  structure  of 
a  turbulent  compressible  mixing  layer  at  high  Mach  numbers. 

4  Three-dimensional  direct  and  large-eddy  simulations 


We  have  used  pseudo-spectral  numerical  methods  in  order  to  simulate  both  the  incompressible 
three-dimensional  temporal  mixing  layer  (with  free-slip  boundary  conditions  on  the  upper  and 
lower  faces  of  the  computational  box),  and  a  gaussian  wake7  in  a  periodic  box.  The  Navier-Stokes 
equation  in  Fourier  space  may  be  written  as'. 

|i(£,0  =  n(«)op(F-‘(fi)xF-1te)]-[«'  +  «/»(*|*c)]*Jii(£,<)  ,  (10) 

where  u(k,t)  is  the  velocity  field,  and  II(fc)  the  projector  on  the  plane  perpendicular  to  k.  F 
stands  for  a  Fast  Fourier  transform  operator.  The  incompressibility  writes: 

*ii(M)  =  °  •  (ii) 

The  term  i/t(fc|Jb<>)  is  a  spectral  eddy- viscosity,  which  will  be  used  for  large-eddy  simulations  (see 
below).  It  is  zero  for  direct-numerical  simulations. 

A  passive  scalar  8(k,  t)  satisfies  the  equation 

|*(M)  =  -,*.F[p-‘(*y-1(i)]  -  [*  +  Kt(*l*«)l  ,  (12) 

where  a  spectral  eddy-conductivity  Kt(kjfce)  will  be  used  also  for  large-eddy  simulations. 

The  resolutions  are  of  48*  collocation  points  for  the  calculations  done  on  the  I.M.G.  Alliant 
VFX40  machine,  and  128*  points  on  the  C.C.V.R.  Cray  2  machine.  The  graphics  are  done  on  the 
Alliant,  using  the  FLOSIAN*  software  we  have  developed  in  Grenoble. 

7  that  is,  developing  from  a  perturbed  gaussian  velocity  profile 
*  FLow  Simulation  ANalysis 
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4.1  Direct-Numerical  simulations 


4.1.1  The  mixing  layer 

The  calculation  domain  in  physical  space  is  a  rectangular  parallelepiped  of  sides  Lx,  Ly  et  L,.  One 
assumes  periodicity  in  the  x  and  x  directions.  The  resolution  is  128s  for  a  calculation  involving 
4  fundamental  eddies  ( Lx  =  4A»).  One  superposes  to  the  basic  flow  a  three-dimensional  isotropic 
wide-band  perturbation  of  kinetic  energy  <  u'1  >=  10~4t/s,  whose  spectrum  peaks  at  aa.  Vi¬ 
sualizations  of  the  scalar*  show  that  a  hairpin-shaped  vortex  of  spanwise  wave  length  A  =  2A„ 
is  formed  in  the  braids,  with  the  same  topology  as  that  observed  experimentally  by  Bernal  and 
Roehko  (1986).  The  same  structures  have  been  found  in  the  direct-numerical  simulations  involving 
two  eddies  done  by  Metcalfe  et  al.  (1987).  The  origin  of  these  hairpin  coherent  structures  is  still 
under  discussion.  A  possible  explanation,  proposed  by  Lasheras  and  Choi  (1988),  comes  from  the 
straining  between  two  Kelvin-Helmholtz  rollers  of  vortex  filaments  perturbed  about  the  stagnation 
line.  We  expect  to  show  more  about  the  vorticity  fields  corresponding  to  these  structures  during 
the  conference. 

4.1.2  The  plane  wake 

The  calculation  is  done  at  a  resolution  of  48s  points.  The  initial  flow  is  a  gaussian  profile,  perturbed 
by  the  same  perturbation  as  above.  Figure  4  shows  the  passive  scalar  during  the  formation  of  the 
Karman  street.  Later  on  longitudinal  vortices  appear,  in  the  outer  edge  of  the  Karman  eddies.  The 
vorticity  contours  will  be  shown  at  the  conference. 


Figure!:  passive  scalar  contour  in  the  three-dimensional  wake  calculation. 


*  This  work  is  in  progress.  Up  to  now,  only  colour  slides  of  the  scalar  field  during  the  pairing  of 
primary  vortices  are  available. 
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4.2  Large-Eddy  simulations 


4.2.1  Spectral  eddy- coefficient* 

We  use  the  concept  of  spectral  eddy-viscosity  and  eddy-conductivity  in  order  to  model  the  sub- 
gridscalee  corresponding  to  k  >  kc,  kc  being  the  cutoff  wave  number  (see  Lesieur,  1987).  Using 
the  non-local  interactions  theory  of  isotropic  turbulence  two-point  closures,  it  may  be  shown  that 
the  kinetic  energy  flux  between  k  and  the  subgridscales  can  be  represented  with  the  aid  of  the 
eddy-viscosity  (Kraichnan,  1976,  Lesieur  and  Schertzer,  1978): 

Mk\kc)  =  -5  /“  •*,  [5 E(p)  +  p~]dP  ,  (13) 

for  k  «  kc.  E(k  is  the  kinetic  energy  spectrum,  and  a  time  characteristic  of  the  triple¬ 
correlations  relaxation.  In  the  same  way,  the  eddy-conductivity  may  be  written  as 

Kt(k\k<)  =  |  r*l„EWP  ■  (14) 

If  fc„  lies  in  a  k-*/*  Kolmogorov  spectrum,  it  is  found: 


*'«(k|^s)  =  0.267l^^i]1/a  , 

KC 

(15) 

««(*!*.)  =  0.445[^]1'‘  . 

(16) 

It  is  these  quantities  which  are  used  in  eqs.  (10)  and  (12). 

When  the  method  is  applied  to  isotropic  decaying  three-dimensional  turbulence,  kinetic  energy 
spectra  cloae  to  k~s/*  are  obtained  in  the  neighbourhood  of  fce,  and  the  kinetic  energy  decays  like 
<-1'37  (see  Lesieur  et  al.,  1989),  in  good  agreement  with  the  predictions  of  the  statistical  EDQNM 
theory  (see  Lesieur,  1967). 

The  eddy  Prandtl  number  i'«(k|kc)/*t(*l*c)>  where  **(k|ka)  **  the  spectral  eddy-diffusivity, 
calculated  using  the  spectral  temperature  transfers,  is,  from  EDQNM  calculations,  taken  constant 
and  equal  to  0.6.  In  fact,  it  was  recently  shown  by  Lesieur  and  Rogallo  (1989)  and  Lesieur  et 
al.  (1989),  on  the  basis  of  a  direct  determination,  that  it  increases  with  k  between  the  values 
0.2  and  0.8.  But  this  variation  has  no  incidence  on  the  following  calculations,  where  the  passive 
temperature  is  used  only  as  a  numerical  dye  to  visualise  the  coherent  structures. 

4.2.2  Tke  mixing  layer 

The  calculation  involves  two  fundamental  eddies,  with  a  resolution  of  64  x  32  x  32.  Figure  5  shows 
the  interface10  when  the  primary  instability  starts  to  grow.  It  indicates  that  a  strong  spanwise 

10  characterised  by  the  isothermal  surface  of  zero  value  initially 
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instability  develops  as  well.  This  spanwise  instability  could  be  due  to  the  straining  between  the 
fundamental  eddies  of  vortex  lines  perturbed  initially  in  the  spanwise  direction  by  the  random 
perturbation.  This  would  lead  to  hairpin  vortices  by  the  mechanism  proposed  by  Lasheras  and 
Choi  (1988),  already  mentioned. 


Figures:  large-eddy  simulation  of  the  mixing  layer;  scalar  field  at  the  beginning  of  the  roll  up. 


Figure  6  shows  the  same  surface  at  the  end  of  the  roll  up.  Figure  7  shows  at  the  same  time 
the  primary  vorticity  uM,  corresponding  to  an  iso-value  2U/8t  .  Notice  that  this  value  is  the 
maximum  vorticity  you  can  get  in  two  dimensions,  due  to  the  vorticity  advection  by  the  motion, 
and  ita  molecular  dissipation.  Notice  that,  on  Figure  6,  the  dark  thin  longitudinal  lines  indicate 
the  longitudinal  vorticity  <■/,,  for  the  same  iso-value  as  <*>,.  It  indicates  that  longitudinal  streaks  of 
intense  vorticity  link  the  billows,  as  in  the  above  direct  numerical  simulations. 


. .  ^ 

,  .  ■’% 

\ 
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Flgurefi:  large-eddy  simulation  of  the  mixing  layer:  scalar  field  at  the  end  of  the  roll  up;  in  dark  is  shown 
the  intense  longitudinal  vorticity. 
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FIgure7:  Urge-eddy  simulation  of  the  mixing  layer;  epanwiee  vortkity  (same  time  as  Figure  6). 


Finally,  Figure  8  shows  during  the  pairing  the  three-dimensional  spatial  spectra  of  the  passive 
temperature,  the  kinetic  energy,  and  the  three  components  of  velocity.  It  indicates  how,  starting 
from  initial  spectra  exponentially  decaying  at  high  wave  numbers,  a  cascade  has  developed  towards 
the  small  scales  (with  a  slope  which  is  fairly  close  to  k~t/a).  This  is  in  very  good  agreement  with 
the  experimental  measurements  of  these  spectra.  Finally,  and  when  the  turbulence  in  the  small 
scales  has  developed,  the  variances  of  the  three  velocity  components  are  found  to  be: 

<  o'*  >=  O.IUA* 

<  »'*  >=  0.0717* 

<  w'*  >=  0.0817* 

again  in  fairly  good  agreement  with  the  experiments. 

It  seems  thus  that  this  spectral  large-eddy  simulation  code  is  a  very  good  tool  which  allows  to 
describe  the  whole  transitional  process  towards  developed  turbulence,  both  for  predicting  the  right 
statistics  and  displaying  the  correct  primary  and  secondary  coherent  structures. 
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Figure8:  luge-eddy  simulation  of  the  mixing  layer;  three-dimensional  spatial  spectra  of  (from  top  to  bottom) 
the  passive  temperature,  the  kinetic  energy,  and  the  three  velocity  components  (longitudinal,  transverse 
and  span  wise). 
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Ahstranl 

When  solving  flow  problems  in  unbounded  domains  it  is  necessary  to  introduce  artificial 
boundaries.  If  the  flow  is  smooth  in  the  far  field  and  there  are  no  significant  viscous 
effects,  it  is  rather  well  Known  how  to  construct  boundary  conditions  such  that  accurate 
solutions  are  obtained.  However,  sometimes  the  computational  domain  cannot  be  extended 
far  enough.  For  example,  when  computing  the  flow  around  a  solid  body,  the  boundary  may 
intersect  the  wall,  thereby  cutting  through  the  boundary  layer.  In  that  layer  the  gradients 
are  very  large  for  large  Reynolds-numbers,  and  the  usual  linearized  equations  are  no  longer 
valid. 

We  shall  analyze  a  few  model  problems  in  order  to  get  an  understanding  of  the  behaviour  of 
the  solutions  depending  on  the  boundary  conditions.  In  particular,  we  shall  discuss  the 
procedure  of  using  the  inviscid  conditions  as  the  basic  set  and  then  add  viscous  conditions  of 
derivative  type.  In  general  the  errors  introduced  in  this  way  are  small  provided  the  given 
data  at  the  boundaries  are  accurate.  If  such  data  are  not  available,  a  common  procedure  is 
to  extrapolate  all  variables.  We  shall  prove  that  this  in  general  introduces  large  errors. 
However,  in  the  case  with  a  boundary  layer,  the  situation  is  more  favorable.  We  shall 
prove  that  large  gradients  along  the  boundary  actually  helps  to  keep  the  error  small. 

We  shall  also  present  a  number  of  numerical  experiments  which  confirm  the  theoretical 
results. 
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Abstract 

This  paper  presents  a  brief  review  of  the  NEC  SX-3  Series 
Architecture  and  Software  from  the  view  point  of  NEC  philosophy 
for  the  SX-3  development.  In  particular,  the  importance  of 
single  processor  power  is  stressed  even  though  this  system 
support  parallel  processing. 

1 .  Introduction 

In  April,  1989  NEC  announced  SX-3  Series  consisting  of  7 
model  configurations,  with  maximum  performance  ranging 
from  1.37  G  Flops  for  the  processor  Model  11,  to  22  G  Flops 
for  the  4  processor  Model  44.  The  SX-3  is  the  first 
Japanese  supercomputer  employed  a  multiprocessor 
configuration. 

2.  SX-3  Development  Philosophy 

The  objectives  of  Supercomputer  is  to  offer  users  the 
problem  solution.  In  other  words,  to  offer  the  capability 
to  solve  the  large  scale  scientific  problems  with  minimal 
cost.  It  has  to  offer  the  shortest  turn-around  time 
or  response-time  for  the  given  problem. 
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Response  time  may  consist  CPU  time,  I/O  time  and 
communication  time.  These  have  to  be  offered  in  well 
balanced  manner,  however,  the  most  important  one  is 
considered  to  be  CPU  time. 

Minimal  CPU  time  is  realized  by  the  appropriate  choise 
of  the  architecture  and  the  device  technology.  The 
criteria  of  consideration  for  the  choise  are  as  follows; 

.  The  selected  architecture  has  to  offer  minimal  CPU  time 
for  the  wide  range  of  application. 

.  The  architecture  is  also  appropriate  one  from  reliability, 
easy-to-use  and  upgradability  view  point. 

.  The  high-speed  device  technology  in  a  given  periods  of 
time  and  the  balance  between  the  architecture  and  the 
device  technology  have  to  be  considered. 

In  the  early  stage  of  SX-3  development,  the  followings  were 
observed  and  then  decided  considering  the  above  mentioned 
criteria. 

.  To  develop  the  fastest  single  processor  is  very  important, 
in  particular,  the  scalar  performance  of  single  processor 
is  never  degraded  even  if  the  cost  is  considerably  high. 

.  The  importance  of  powerful  scalar  processor  is  never 
too  stressed.  Because  even  if  the  vectorization  or 
parallelization  ratio  of  the  program  is  say  over  99%  and 
remaining  1%  is  to  be  processed  by  the  slower  scalar 
processor,  then  the  total  parformance  is  severely  degraded. 

.  Even  more,  these  exists  a  lot  of  large  application 
programs  whose  vectorization  or  parallelization  ratio 
is  rather  lower. 
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-  In  the  present  and  foreseenable  future,  SIMD  type  of 
application  is  considered  to  be  dominant  over  MIMD  type 
application,  therefore  the  vector  processing  support 
has  priority  to  the  parallel  processing  support.  Note 
that'-  'microtasking '  is  considered  to  be  a  kind  of  vector 
processing. 

•  Yet,  there  exist  some  applications  which  require  MIMD 
operation  and  also  the  user  site  where  throughput  is 
important,  therefore  multi-processor  support  is  also 
necessary  as  second  priority. 

•  In  case  of  multi-processor  system,  the  shared  memory 
type  architecture  is  appropriate,  considering  the  ease- 
of-use  and  the  continuation  from  single  processor  system 
point  of  view. 

•  For  the  device  technology,  Silicon  has  to  be  continu¬ 
ously  used  because  devices  such  as  GaAs  and  JJ  is  yet 
in  infant  stage  and  silicon  technology  has  enough  room 
to  further  speed  enhancement,  and  it  is  stable  and 
economical . 

The  observation  described  above  strongly  affected  the 
choice  of  architecture  and  technology  for  SX-3  development. 
What  have  to  be  stressed  here  is  that  NEC  strongly  favors 
the  system  which  has  the  most  powerful  processors  by  the 
faster  VLSI  technology  and  never  supports  the  system  which 
have  many  processors  by  relatively  slower  technology  as 
general  purpose  supercomputers. 

SX-3  technology 

NEC  continues  to  use  silicon  VLSI  technology  for  achieving 
the  most  powerful  scalar  as  well  as  vector  and  parallel 
processing  capability. 
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Using  high-speed  Silicon  bipolar  CML  (current  mode 
logic)  circuitry,  the  CML-LSI  logic  elements  of  the 
SX-3  have  gate  delays  of  70  picoseconds  and  contain 
up  to  20,000  gates  per  chip.  The  high  speed  40  Kbit 
RAM  chips  used  in  the  vector  registers  contain  7,000 
gates  of  logic  and  have  access  times  of  1.6  nanoseconds. 

Main  memory  chips  are  256  Kbit  Bi-CMOS  static  RAMs  with 
access  times  of  20  ns,  and  the  extended  memory  is 
integrated  with  1  Mbit  MOS  dynamic  RAMs. 

Up  to  100  high  speed  LSIs  (2,000,000  gates)  are 
contained  on  each  22.5  x  22.5  cm  ceramic  package. 

Due  to  the  enormous  number  of  input/output  terminals 
to  connect,  four  signal  wire  layers  were  employed  on  the 
ceramic  board.  The  density  of  the  chip  layout  is 
further  augmented  with  the  use  of  polymide  insulation 
which  enables  faster  signal  delay  times,  resulting  in 
very  fast  signal  propagation. 

SX-3  Architecture 

As  mentioned  above,  our  basic  approach  to  realize  a  high¬ 
speed  computer  system  is  to  enhance  a  single  processor 
performance  to  the  ultimate,  and  then  to  combine  those  ultra 
high-speed  processors  constituting  a  multiprocessor  system. 

Figure  1  shows  a  maximum  configuration  of  the  SX-3  system, 
and  Table  1  shows  the  system  specifications  together  with 
those  of  the  SX-2A,  the  top  end  model  of  the  former  SX 
series.  In  a  maximum  configuration,  four  arithmetic 
processors  share  a  common  main  memory  unit  which  has  a 
capacity  of  up  to  2  GBytes.  The  shared  memory  system  and 
a  small  number  of  high-speed  processors  to  realize  a 
multiprocessor  syBtem  give  users  the  ease  of  use  and  easy 
programming  environments,  because  they  don’t  need  to  care 
about  the  memory  allocation  algorithm,  different  from  the 
distributed  memory  system,  and  don't  need  to  augment  the 
degree  of  parallelism  to  fully  utilize  the  hardware 
capability. 
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In  Figure  1,  the  Control  Processor  (CP)  performs  I/O 
management.  On  the  other  hand,  the  Arithmetic  Processor 
(AP) ,  which  has  internally  a  scalar  and  vector  unit, 
is  an  ultra  high-speed  Fortran  engine  and  executes  all 
the  user  codes  compiled  by  Fortran  compiler  and  major  super¬ 
visory  operations. 

The  Control  Processor  Memory  (CPM)  with  a  capacity  of  up  to 
256  MBytes  is  used  as  large  I/O  buffer.  The  Main  Memory 
Unit  (MMU)  is  a  large  and  fast  memory  for  the  execution  of 
user  and  supervisory  programs  running  on  the  Arithmetic 
Processors.  To  transfer  a  large  amount  of  vector  data 
quickly  and  smoothly,  the  MMU  is  divided  into  a  maximum 
of  1,024  independent  banks,  that  is  1,024  way  interleaved 
system  is  employed. 

The  Extended  Memory  Unit  (XMU)  is  a  large  capacity  semi¬ 
conductor  memory  unit  ranging  from  1  GBytes  to  16  GBytes, 
and  works  as  a  very  high-speed  virtual  disk  unit.  The 
XMU,  which  has  a  transfer  speed  of  2.75  GBytes/sec,  is 
used  for  temporary/permanent  disk  files,  disk  cache  buffer 
and  job  swapping  files. 

The  SX-3  can  configure  up  to  four  I/O  processor  with  an 
aggregate  transfer  speed  of  1  GBytes/sec.  Each  I/O 
processor  has  up  to  64  channels  through  which  various 
peripherals  such  as  disk  units,  cartridge  library,  laser 
printers,  optical  disk  units  are  connected. 
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SX-3  AP  Scalar  Unit 


Each  AP  Scalar  Unit  consists  of  a  full  complement  of  64  bit 
floating  and  fixed  point  arithmetic  pipelines.  Each 
AP  scalar  unit  has  128  general  purpose,  64  bit,  registers. 
The  scalar  unit  issues  both  scalar  and  vector  instructions. 
The  machine  instruction  set  is  based  on  the  RISC  (Reduced 
Instruction  Set  Computing)  concept. 

Numeric  Representations 

"Standard  Range"  is  recommended  when  greater  precision  is 
desired;  it  is  compatible  with  IBM  single,  double,  and  quad 
precision  floating  point  representations.  "Extended  Range" 
is  recommended  when  a  large  numeric  range  is  required;  it 
is  compatible  with  CRAY  single  and  double  precision 
floating  point  representations.  The  format  can  be  selected 
at  compile  time  by  use  of  a  compiler  switch. 

Instruction  Chaining 

Full  instruction  chaining  is  supported.  Chaining  is  an 
advanced  form  of  pipelining  which  allows  either  related  or 
unrelated  vector  instructions  to  begin  execution  before 
previous  vector  instructions  complete,  even  though  they 
may  use  the  same  registers  or  pipelines. 

SX-3  Series  Main  Memory 

Main  Mamory  is  interleaved  up  to  1024  ways.  80  Gigabytes/ 
second  of  concurrent  sequential  vector  load/store,  constant 
stride  vector  load/store,  vector  gather/scatter,  scalar 
cache  load,  scalar  store,  I/O,  and  XMU  transfer  are 
supported  on  the  SX-3. 

The  XMU  consists  of  up  to  16  Gigabytes  of  1  Megabit  DRAM 
(Dynamic  Random  Access  Memory) .  Transfer  speed  to  and  from 
Main  Memory  is  2.75  Gigabytes/second. 
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Reliability 

Traditional  supercomputers  have  been  designed  for  the  sole 
purpose  of  high  speed  execution.  As  a  result,  certain 
reliability  and  error  checking  features,  common  to  lesser 
machines,  have  been  sacrificed. 

Most  recent  designs  have  put  greater  emphasis  on  system 
stability  and  reliability.  As  an  example,  reliability  of  the 
SX-3  Series  is  supported  by  BID  (Built  In  Diagnostics) . 

BID  are  implemented  in  hardware;  the  scope  of  BID  includes 
over  10,000  error  indications  within  the  system. 

Computational  circuits  are  continuously  monitored  by 
modulo-3  verification  and  dual  circuit  confirmation. 

If  an  unrecoverable  error  is  detected,  the  faulty  unit  is 
automatically  stopped.  A  hardware  implemented,  automated 
fault  dictionary  is  referenced  immediately  upon  detection 
of  the  unrecoverable  error.  Therefore,  the  maintenance 
engineers  know  which  modules  to  replace  within  seconds  of 
the  error  detection. 

The  SUPER-UX  Operating  System 

The  primary  operating  system  for  all  models  of  the  SX-3 
Series  is  SUPER-UX,  a  supercomputer  enhanced  operating 
system  based  on  AT&T  System  V  UNIX.  Some  of  the  major 
extensions  added  to  support  the  requirements  of  supercomput¬ 
ing  include  the  (a)  Intelligent  I/O  Accelerator  Subsystem, 
(b)  Supercomputer  File  System,  and  (c)  BATCH  processing 
facilities . 
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Commitments  for  ongoing  SUPER-UX  development  are  to 
maintain  compatibility  with  AT&T  UNIX,  IEEE  1003.1  <POSIX) , 
relevant  future  standards  as  they  are  adopted,  and  to 
continue  enhancing  both  performance  and  usability  for  the 
supercomputer  community. 

IAS  (Intelligent  I/O  Accelerator  System) 

IAS  is  an  operating  system  extension  which  intelligently 
buffers  I/O  data  through  a  series  of  multi-level  data 
caches.  Sophisticated  algorithms  eliminate  data  thr<*rhing 
while  providing  automated,  user  independent,  high 
performance  I/O,  including  transparent  parallel  I/O. 

SFS  (Supercomputer  File  System) 

SFS  is  a  file  system  extension  which  is  optimized  for  large 
scientific  data  files.  This  compares  to  the  standard 
System  5  file  system  (S5FS)  which  is  designed  for  rather 
small  data  quantities,  such  as  programs  and  shell  scripts. 
SFS  allocates  file  space  in  units  of  disk  tracks  or 
cylinders;  S5FS  allocates  by  sector  units  (512  bytes) .  Both 
file  systems  are  fully,  transparently  supported  within 
SUPER-UX. 


BATCH  Processing  and  Networks 

One  of  the  shortcomings  of  UNIX  is  the  lack  of  batch 
processing  capability.  Since  one  of  the  primary 
environments  of  supercomputing  is  the  batch  workload,  an 
implementation  of  NQS  has  been  ported  and  coupled  with 
enhanced  scheduling  and  job  control  functionality. 

A  distributed  production  environment  is  further  supported 
by  NFS  and  TCP/IP  networking  capability,  the  latter 
including  the  telnet  and  ftp  facilities.  Future  extensions 
will  include  OSI  and  FDDI  networking  products. 
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7.  Compilers 

FORTRAN  is  the  primary  programming  language  in  scientific 
computing,  and  as  such,  is  the  basis  of  providing  automated 
vector  and  parallel  processing  capabilities.  The  FORTRAN 
compiler,  FORTRAN77/SX  is  based  on  the  advanced 
vectorizing  compiler  used  for  the  SX-2A.  Enhancements  are 
added  to  provide  state-of-art  automated  parallel  processing. 
Vectorization  features  include  loop  collapsing,  automatic 
statement  interchange,  index  migration,  automatic  inlining, 
and  automatic  loop  unrolling.  Automated  parallel  processing 
is  initially  implemented  by  microtasking  techniques. 

The  C  compiler,  C/SX,  will  feature  automatic  vectorization 
and  automated  inlining.  It  will  be  suitable  for 
applications  development  as  well  as  systems  and  utility 
generation. 

8.  Summary 

The  SX-3  is  the  first  Japanese  machine  to  be  able  to  use 
parallel  processing  to  either  increase  overall  throughput 
of  multiple  jobs  and  reduce  the  turn  around  time  of  a  single 
job.  However,  NEC's  basic  design  approach  for  high-speed 
processing  is  to  pursue  the  ultimate  scalar  performance 
of  a  single  processor  as  well  as  vector  processing 
capability. 

It  should  also  be  noted  that  one  of  the  major  technological 
progresses,  which  contributed  to  the  realization  of  this 
type  of  multiprocessor  system  is  in  Silicon  VLSI  and 
high-density  packaging  technologies. 
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Fig.  2  SX-3  SERIES  Arithmetic  Processor  Configuration 


LARGE  SCALE  APPLICATIONS  OF  TRANSPUTERS  : 
ACHIEVEMENT  AND  PERSPECTIVE 

D.J.  Wallace, 

Physics  Department,  University  of  Edinburgh 

Abstract 

This  paper  gives  an  overview  of  large  scale  applications  of  transputers  in  the  context  of  the 
Edinburgh  Concurrent  Supercomputer  Project.  This  is  built  around  a  Meiko  Computing 
Surface,  with  presently  some  400  floating-point  transputers  and  1.7  Gbytes  of  memory.  The 
first  part  of  the  paper  gives  an  overview  of  the  Project’s  origins  and  status  and  describes 
experience  gained  in  providing  a  multi-user  service.  The  second  part  gives  examples  of 
applications  which  are  able  to  exploit  effectively  this  processing  power.  Tools  which  facilitate 
the  use  of  the  machine  for  large  scale  computation  and  visualisation  are  also  briefly  described. 


1  Project  Origins 

Work  at  Edinburgh  on  the  use  of  parallel  computers  for  Physics  applications  began  in  1980, 
on  the  ICL  Distributed  Array  Processor  at  Queen  Mary  College.  This  initiative  was  spear¬ 
headed  by  Stuart  Pawley,  whose  main  interest  was  in  molecular  dynamics,  but  the  work 
rapidly  expanded  into  lattice  field  theory  as  we  appreciated  the  potential  of  this  fine-grain 
SIMD  machine.  We  were  successful  in  acquiring  a  dedicated  machine  at  Edinburgh  in  1982, 
through  funding  from  the  Science  and  Engineering  Research  Council  and  an  agreement  for 
software  development  between  ICL  and  the  Edinburgh  University  Computing  Service;  a  sec¬ 
ond  DAP  was  donated  by  ICL  in  1983.  The  machines  had  to  be  decommissioned  in  1987,  on 
the  replacement  of  the  ICL  hosts  which  acted  as  the  University  mainframe  resource.  At  that 
time  some  180  publications  had  resulted  from  the  work  spanning  many  application  areas;  a 
summary  of  the  work  can  be  found  in  [1], 

The  anticipated  replacement  of  the  ICL  host  machines,  and  the  now  large  community  de¬ 
pendent  on  high  performance  computing  obliged  us  to  look  for  alternative  sources  of  it.  We 
were  convinced  then  that  the  only  way  we  would  have  access  to  the  required  power  with  the 
budgets  we  might  expect  was  through  exploiting  novel  architecture  parallel  machines.  We 
already  had  a  relationship  with  Inmos  who  were  supporting  a  student  in  parallel  computing 
for  high  energy  physics,  and  were  fortunate  to  obtain  one  of  the  earliest  Meiko  Computing 
Surfaces  in  April  1986,  with  the  support  of  the  Department  of  Trade  and  Industry  and  the 
Computer  Board.  This  demonstrator  system  consisted  of  40  T414  transputers  each  with  1/4 
Mbyte,  along  with  a  display  system,  and  was  file  served  and  networked  through  a  micro  VAX 
host.  The  reliability  of  this  system,  the  imminent  loss  of  the  DAPs  and  a  survey  of  existing 
parallel  machines  formed  the  cornerstone  of  the  proposal  for  the  Edinburgh  Concurrent  Su¬ 
percomputer  (ECS).  The  proposal,  in  collaboration  with  Meiko,  sought  some  T3.4M  from 
the  SERC,  DTI  and  Computer  Board  to  fund  a  machine  built  around  1024  T800  transputers 
(with  floating-point  capability)  each  with  1  Mbyte  of  memory,  to  provide  an  electronically 
reconfigurable  multi-user  resource.  After  some  three  months  the  three  funding  agencies 
agreed  in  principle  to  consider  joint  funding  of  the  machine.  Phase  1  funding  for  the  ma¬ 
chine  infrastructure  and  compute  resource  of  200  T800s,  each  with  4  Mbytes,  was  secured 
during  1987,  multi-user  service  for  code  development  was  established  in  September  1987, 
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Department  of  Trade  and  Industry 

T1144K 

Computer  Board 

T575K 

SERC  (Science  and  Nuclear  Physics  Boards) 

T400K 

Meiko  Limited  (exd.  personnel) 

T779K 

Scottish  Development  Agency 

TI2K 

Industrial  Affiliation  etc.(cash  and  kind) 

T800K 

Edinburgh  Univ.  3  Comp.  Officers,  infrastr.,  plus  cash  of 

T102K 

Table  1:  Funding  support  for  the  ECS  as  at  December  1989 
and  the  first  T800  compute  resource  installed  at  the  end  of  that  year. 


2  Present  Status 

2.1  Funding 

The  present  level  of  commitment  to  the  Project  is  shown  in  Table  1.  The  initial  DTI  con¬ 
tribution  funded  the  infrastructure  for  the  machine,  including  cabinets,  inter-cabinet  link 
boards,  32  host  boards  for  code  development,  4  display  systems  and  some  5  Gbytes  of  mem¬ 
ory.  A  second  phase  of  DTI  support  has  enabled  us  to  expand  the  compute  resource  with 
another  195  T800s  and  900  Mbytes  of  memory.  The  Computer  Board  and  SERC  support 
includes  some  T600K  for  hardware,  which  provided  the  Phase  1  T800  compute  resource, 
and  a  frame  grabber  and  fast  I/O  system.  Their  contributions  also  contain  some  funds  for 
software,  maintenance,  and  two  to  three  people  for  up  to  five  years.  Meiko  have  also  con¬ 
tributed  very  significantly  in  discount,  maintenance  and  software;  in  addition  they  site  two 
people  at  Edinburgh  and  have  considerable  in-house  software  activity  to  meet  the  Project 
requirements.  The  University  has  committed  T102K  in  cash  as  well  as  three  computing  offi¬ 
cers  for  service  management  and  support.  At  the  time  of  writing  (December  1989),  there  are 
a  total  of  15  personnel  in  the  core  of  the  Project  (i.e.  excluding  specific  application  teams); 
Meiko  support  and  the  use  of  the  income  from  the  industrial  affiliation  scheme  in  supporting 
10  of  these  people  have  been  crucial  factors  in  the  successful  launch  of  the  Project.  It  is 
anticipated  that  substantial  recurrent  funding  from  the  Science  and  Engineering  Research 
Council  and  from  the  Computer  Board  will  enable  the  establishment  of  a  parallel  comput¬ 
ing  centre  early  in  1990  incorporating  work  on  two  AMT  DAPs  and  departmental  shared 
memory  resources,  as  well  as  the  ECS  Project. 

2.2  Hardware  Configuration 

General  information  about  the  transputer,  the  Computing  Surface  and  the  occam  language 
can  be  found  in  [2].  The  machine  organisation  is  shown  schematically  in  figure  1.  The 
computational  model  is  that  of  a  network  of  workstations  and  file  servers.  It  is  realised  by 
a  transparent  communication  spine  which  is  also  based  on  transputers  and  off  which  hang 
the  various  resources  of  the  machine.  The  microVAX  host  of  the  original  demonstrator 
system  is  now  retained  as  one  of  the  file  servers,  and  to  provide  a  VMS  environment.  The 
user  can  also  file  serve  off  a  number  of  Hewlett-Packard  disks  running  a  Berkeley  4.2  BSD 
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Figure  1:  Schematic  diagram  of  Computing  Surface  organisation 

file  system.  Access  is  provided  by  direct  lines  and  the  facility  is  networked  with  ethemet 
and  X.25  connections.  To  provide  multi-user  capability  the  compute  resource  is  divided 
into  domains;  the  sise  and  number  of  these  is  controlled  by  the  system  manager,  and  may 
be  changed  (exploiting  the  software  reconfiguration)  for  example  to  match  day  and  night 
time  user  needs.  At  present  there  are  typically  some  30  domains  on  the  service  machine;  a 
separate  machine  for  system  development  can  support  a  further  8  users.  At  login,  the  user 
is  presented  with  a  menu  of  domains;  she  boots  an  available  domain  and  connects  to  a  file 
server  and  has  then  a  personal  parallel  machine.  Back-up  is  provided  by  Exabyte  cartridges. 


2.3  Operational  Aspects 

Experience  at  Edinburgh  in  the  offering  of  a  multi-user  service  on  novel  architectures  is  based 
on  the  two  mainframe-hosted  DAPs  between  1982  and  1987  and  on  the  ECS  since  1987. 
Although  these  machines  are  of  very  different  architectures,  the  operational  problems  of 
supplying  and  maintaining  a  multi-user  service  are  very  similar.  The  problems  are  principally 
of  two  distinct  types:  those  of  actual  day-to-day  supply  of  the  service,  and  those  of  providing 
effective  support  to  the  users  of  the  service. 

The  first  class  of  problems  is  mainly  concerned  with  the  fair  allocation  of  machine  resource. 
Such  problems  as  the  amounts  of  on-line  and  off-line  file  space  can  be  critical  as  all  the 
experience  gained  shows  that  the  amounts  of  disk  requirements  may  be  phenomenal  both 
in  total  sise  and  in  their  rate  of  growth.  The  gift  of  disks  under  the  Hewlett-Packard 
Affiliation  agreement  has  therefore  been  particularly  valuable.  Another  major  problem  is 
that  of  ensuring  access  for  long  runs,  which  are  frequently  required  and  can  prove  to  be  a 
scheduling  problem. 

The  experience  at  Edinburgh  is  that  there  is  almost  always  pressure  on  the  facility,  no  matter 
how  much  parallel  resource  is  available.  In  the  ECS  we  have  two  opposing  requirements:  to 
partition  the  resource  so  that  many  users  may  access  it  at  the  same  time,  and  to  provide 
the  largest  domains  possible  for  a  fewer  number  of  users  for  long  production  runs. 
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The  second  class  of  problems  concerns  offering  a  reasonable  level  of  user  support.  Because  of 
the  different  nature  of  the  coding  strategy  and  structure,  typically  a  very  high  level  of  specific 
expertise  is  required  of  any  user  support  staff.  At  this  stage  in  parallel  computing  software 
development,  such  staff  have  to  be  specialists,  while  traditionally  user  support  personnel  in 
University  Computing  Centres  have  had  a  more  widely  based  and  general  background.  This 
problem  will  of  course  become  less  acute  as  parallel  programming  environments  improve. 


3  System  and  Utilities 

A  stand-alone  multi-user  system  MMVCS  has  been  in  use  since  September  1987.  It  provided 
the  user  with  the  Occam  programming  system,  OPS,  and  UNIX1  file  serving  utilities.  The  full 
UNIX-like  system  MEiKOS  was  released  into  the  Service  Machine  in  October.  Among  the 
work  performed  on  the  development  machine  is  a  porting  of  communications  software  and  the 
AT&T  System  V.3  utilities.  There  are  C  and  Fortran  compilers  for  single  transputers  and  a 
range  of  utilities  is  available  or  under  development  at  Meiko;  Meiko’s  parallel  programming 
environment  for  Fortran  and  C  code  with  message  passing,  CS-tools,  has  been  used  in  a 
number  of  projects  over  the  summer.  Standard  packages  which  have  been  ported  include 
GKS. 

There  has  been  considerable  effort  at  Edinburgh  to  develop  utilities  which  provide  greater 
flexibility  and  ease  of  porting  of  codes  to  the  Computing  Surface.  The  cornerstone  of  this 
effort  is  the  development  of  fast  topology  independent  adaptive  message  passing  systems 
[3,4].  The  utility,  called  Tiny,  explores  the  transputer  configuration  at  run-time  and  sets 
up  point-to-point  communications  and  broadcasts.  Code  does  not  have  to  be  recompiled  to 
run  on  different  configurations.  The  harness  also  has  fault  tolerant  capabilities;  provided 
a  booting  path  is  available  through  each  transputer,  efficient  routing  between  pairs  will  be 
set  up  even  if  some  of  the  links  are  defective  -  they  will  simply  not  be  utilised  in  setting 
up  the  routing  tables.  The  utility  can  be  called  from  Fortran  or  C  as  well  as  used  in 
an  Occam  program.  Various  flags  permit  the  user  to  specify  whether  data  must  arrive  in 
the  lame  order  as  it  was  sent  etc.,  and  the  size  of  the  buffers  can  be  varied  to  match  the 
application  requirements.  The  system  has  particularly  good  characteristics  under  heavy 
load;  for  example  if  messages  arriving  on  link  0  cannot  be  forwarded  on  link  1  because  the 
latter  is  blocked,  other  messages  arriving  on  link  0  can  be  passed  on  by  links  2  or  3  if  these 
are  available. 

A  recent  major  development  has  been  the  extension  of  this  utility  to  provide  deadlock- 
free  communication.  The  approach  is  based  on  recursively  casting  spanning  trees  in  the 
processor  graph,  followed  by  a  reconstructive  phase  which  repairs  excessive  damage.  The 
method  finds  the  shortest  distance  solutions  for  regular  graphs  such  as  grids,  and  for  irregular 
topologies  the  mean  interprocessor  communication  distance  is  only  modestly  increased,  for 
example  by  around  25%  for  a  random  transputer  graph  of  256  nodes.  It  should  be  said  that 
in  practice  the  original  Tiny  has  only  rarely  been  known  to  deadlock  unless  the  user  has 
written  an  incorrect  program  or  insufficient  buffers  have  been  provided;  the  deadlock-free 
Tiny  is  important  however  both  as  a  matter  of  principle  and  for  safety-critical  applications. 

The  start-up  latency  for  the  utility  is  17  microseconds  on  both  the  send  and  receive  pro¬ 
cessors,  and  the  through-routing  CPU  overhead  time  is  19  microseconds  per  node,  with 
a  realised  bandwidth  of  around  1.4  Mbytes  per  second  per  link.  These  figures  compare 

‘UNIX  is  a  trade  mark  of  AT&T  Bell  Laboraiories 
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favourably  with  the  iPSC/2  for  communication  over  up  to  three  links,  particularly  when  one 
bears  in  mind  that  we  are  comparing  hardware  and  software  through-routing  capability,  and 
that  they  refer  to  a  lightly  loaded  network. 

From  the  user  point  of  view,  utilities  like  Tiny  are  important  because  they  assist  flexible  and 
portable  code  development.  It  is  already  the  basis  for  a  number  of  other  topology  indepen¬ 
dent  utilities  for  example  for  task  farming  and  3-d  graphics.  It  has  also  been  used  to  explore 
the  properties  of  irregular  graphs  [5]  which  have  many  attractive  features,  including  mean 
interprocessor  distance  and  diameter  which  increase  only  logarithmically  with  the  number 
of  (fixed-valency)  nodes,  and  are  very  close  to  the  optimal  bound.  Such  graphs  provide  a 
framework  for  shared  memory  emulation  on  distributed  memory  machines,  following  the 
work  of  Valiant  [6]. 

Although  the  implementation  does  not  utilise  random  graphs,  it  should  also  be  mentioned 
here  that  Linda  has  also  been  implemented  on  the  ECS  [7].  The  kernel  has  recently  been 
rewritten  in  C  using  Tiny  [8]  and  tools  will  soon  be  in  place  to  allow  C-Linda  programs  to 
be  developed  and  run  on  any  topology;  it  is  also  intended  that  a  Prolog  interpreter  with 
added  predicates  will  be  ported  to  provide  a  Prolog-Linda  implementation. 


4  Node  Performance 


Reasonable  performance  on  a  single  node  is  obviously  an  important  prerequisite  for  super¬ 
computer  performance  across  an  array.  We  summarise  here  experience  gained  in  a  range  of 
applications. 

For  well-structured  Fortran  or  C  code  which  is  floating-point  intensive,  benchmarks  for  single 
precision  give  up  to  0.6  or  0.7  Mflops  per  node.  A  number  of  applications  written  in  occam 
sire  running  at  in  excess  of  1  Mflops  per  node.  Some  comparisons  relative  to  workstations 
are  given  in  the  examples  of  applications  below. 

To  achieve  maximum  performance  with  minimum  effort,  BLAS1  routines  have  been  written 
in  assembler  for  a  single  T800  [D.  Roweth  and  L.J.  Clarke,  unpublished].  The  table  below 
illustrates  the  performance  obtained  in  these  routines. 


routine 

Mflops 

routine 

Mflops 

saxpy 

1.17 

daxpy 

0.72 

sdot 

1.17 

ddot 

0.78 

snorm 

1.58 

dnorm 

1.05 

sscal 

0.78 

dscal 

0.49 

ssum 

1.35 

dsum 

1.03 

5  Applications 

There  are  currently  over  300  registered  users  of  the  facility.  The  second  edition  of  the 
Project  Directory  [9]  gives  a  snapshot  of  around  100  projects  under  way  as  of  September 
1989.  The  following  sections  therefore  contain  only  a  few  examples  of  some  of  these  projects, 
to  illustrate  areas  where  it  would  not  have  been  possible  to  do  the  work  without  the  ECS 
facility.  The  topics  fall  into  three  broad  categories:  visualisation  and  image  processing, 
simulation  and  control,  and  interactive  design. 
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5.1  Visualisation  and  Image  Processing 

The  four  display  systems  and  frame  grabber  facility  support  a  wide  range  of  activities  in 
these  areas,  including  constructive  solid  geometry,  and  NMR  and  other  medical  imaging 
techniques.  The  existence  of  the  3-d  graphics  utility  will  further  expand  this  work.  We 
focus  brief  comments  on  two  specific  examples. 


5.1.1  Radiosity 

Radiosity  can  produce  3-d  visualisation  of  vei/  high  quality.  The  method  is  based  on  (i) 
dividing  the  surfaces  in  the  scene  into  N  patches  and  determining  line  of  sight  visibility 
between  pairs  of  patches,  (ii)  solving  the  N  simultaneous  equations  for  the  brightness  of 
each  patch,  and  (iii)  rendering  the  scene  using  z-buffering  or  ray-casting.  Rendering  the 
scene  from  different  view-points  requires  only  stage  three,  and  changing  the  lighting  only 
stages  two  and  three. 

This  algorithm  has  been  implemented  at  Edinburgh  to  run  on  an  array  of  arbitrary  size 
[10].  The  natural  parallelism  in  the  calculations  of  the  form-factors  (the  coefficients  relating 
the  contribution  of  the  illumination  of  one  patch  due  to  the  light  from  another)  and  in  the 
iteration  of  the  matrix  of  these  form  factors  to  determine  the  solution  enables  the  method 
to  run  effectively  on  hundreds  of  transputers. 


5.1.2  Fractal  Landscapes 

Fractals  are  well  known  to  produce  realistic  and  spectacular  landscapes,  and  some  of  the 
larger  Hollywood  special  effects  studios  such  as  Lucas  Film  are  starting  to  use  them  in  films. 
The  conventional  approach  constructs  the  surface  from  random  numbers  starting  from  a 
plane. 

An  undergraduate  project  in  Computer  Science  [M.  White,  unpublished]  was  undertaken 
to  extend  this  method  to  construct  fractal  planets.  The  project  was  originally  conceived 
as  a  serial  program  to  run  on  a  SUN  workstation.  However,  as  the  student  discovered,  the 
method  has  enormous  computing  requirements.  To  obtain  results,  a  last  minute  decison  had 
to  be  made  to  move  to  the  ECS.  The  porting  took  two  weeks.  The  resulting  C  code  ran  20 
times  faster  on  one  transputer  than  on  a  SUN  3/60  (without  floating  point  support),  and 
the  level  of  natural  parallelism  in  the  ray-casting  phase  enabled  the  largest  domain  available 
on  the  ECS  (130  T800s)  to  be  used  efficiently.  The  most  complicated  picture  took  one  hour 
on  this  domain,  and  would  have  taken  more  than  2500  hours  (more  than  111  days)  on  the 
SUN.  Would-be  space  travellers  will  be  interested  to  know  that  these  hospitable  planets  with 
their  oceans  and  continents  can  be  visited  and  explored  on  the  screen. 


5.2  Simulation  and  Control 

Simulation  in  scientific  and  engineering  problems  offers  great  scope  for  parallelism.  The 
use  of  data-paralleli«m  (geometric  decomposition)  is  rather  well  understood  now  (see  for 
example  [11]  and  references  therein),  in  particular  the  conditions  under  which  the  application 
can  be  scaled  up  to  run  at  the  same  speed  on  arbitrary  numbers  of  processors.  Simply 
stated,  it  is  sufficient  that  the  dimension  of  the  connectivity  of  the  computer  be  at  least 
as  large  as  the  spatial  dimension  of  the  problem  -  usually  two  or  three.  Thus  it  would 
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be  theoretically  advantageous  to  have  at  least  six  links  on  transputers  to  ensure  scaling  in 
three-dimensional  problems,  although  we  have  not  encountered  serious  problems  in  practice 
with  present  hardware  limitations  (and  hardware  through-routing  seems  in  any  event  a  more 
important  target).  We  focus  here  on  four  examples. 

5.2.1  Lattice  Field  Theory 

Applications  in  High  Energy  Physics  are  outwith  the  concerns  of  this  Meeting,  but  some 
comments  are  in  order  here,  because  they  are  traditionally  amongst  the  first  to  be  mounted 
on  high-performance  parallel  machines,  and  although  they  are  not  communications-intensive, 
they  provide  a  more  relevant  and  convincing  benchmark  than  the  Mandelbrot  set.  In  brief 
they  involve  simulations  on  a  four-dimensional  lattice  (approximating  space  and  time),  in 
which  the  variables  are  the  elementary  particles  of  the  theory  (quarks,  gluons,  electrons  etc.). 
Because  the  spin-half  particles  are  Fermions  and  obey  the  Pauli  exclusion  principle,  special 
algorithms  must  be  developed.  These  algorithms  typically  involve  a  sparse  matrix  inversion 
at  each  time-step  of  the  simulation.  This  matrix  has  as  one  of  its  indices  the  space-time 
lattice  points;  on  a  164  lattice  this  implies  a  65000  x  65000  matrix.  Effort  at  Edinburgh  (for 
a  review  see  [12]  and  references  therein)  has  been  focused  on  the  phase  transitions  resulting 
from  the  interactions  of  Fermionic  particles,  and  on  the  possibility  of  models  unifying  the 
known  fundamental  forces  without  the  need  for  the  elusive  Higgs  scalar  particle.  These 
codes  can  be  run  on  any  size  of  domain;  on  130  T800s  they  deliver  more  than  lOOMflops. 

5.2.2  High  Temperature  Superconductors 

Superconductivity  is  another  area  of  science  where  Fermions  (electrons)  play  a  central  role. 
This  field  has  returned  to  prominence  with  the  discovery  of  materials  that  superconduct 
above  liquid  nitrogen  temperatures.  The  properties  of  these  materials  are  not  yet  well 
understood,  and  computer  simulation  is  potentially  an  important  tool. 

Jones  and  Yeung  at  Queen  Mary  College  have  implemented  a  parallel  variational  Monte 
Carlo  method  to  study  a  model  of  these  materials  which  may  be  both  superconducting 
and  magnetic  [13J;  the  method  was  developed  and  tested  on  a  small  local  machine  and 
benchmarking  and  production  runs  done  on  the  ECS.  The  algorithm  achieves  near-linear 
scale-up  which  indicates  that  it  will  run  with  acceptable  efficiencies  on  arrays  of  hundreds  or 
more  T800s.  To  date  the  program  has  been  successfully  run  on  a  22  X  22  lattice,  probably 
the  largest  of  any  attempted  so  far.  This  implementation  on  transputers  makes  possible  the 
study  of  the  physics  for  a  range  of  practically  interesting  parameters. 

5.2.3  Neural  Network  Models 

The  ‘wetware’  components  in  the  nervous  system  have  typical  timescales  of  the  order  of 
milliseconds,  in  contrast  to  the  nanoseconds  of  semiconductor  hardware,  and  our  remark¬ 
able  neural  processing  capability  is  due  in  part  at  least  to  its  massive  parallelism.  Most 
neural  network  models  reflect  this  parallelism  and  are  amenable  to  study  on  parallel  com¬ 
puters.  Activity  on  the  ECS  covers  many  aspects,  including  general  pattern  processing  and 
optimisation. 

The  most  widely  studied  network  for  practical  applications  is  the  layered  network,  which 
is  trained  by  error  correction  methods  (back-propagation)  so  that  the  desired  processing  is 
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achieved  for  given  input  data  with  known  outputs  (for  example,  in  medical  diagnosis,  the 
input  could  be  symptoms  and  the  output  likely  diagnosis  and  suggested  treatment).  The  key 
idea  is  that  the  net  should  capture  the  intrinsic  correlations  of  the  data,  so  that  its  capability 
generalises  to  new  data  in  the  domain  in  which  it  has  been  trained.  A  simulator  has  been 
developed  on  the  ECS  [14]  which  enables  the  user  to  specify  the  number  of  layers  in  the 
network,  the  number  of  nodes  in  each  layer,  the  nature  of  the  connectivity  between  layers, 
etc.  This  is  then  automatically  mapped  down  on  to  the  transputer  array.  The  simulator  has 
been  used  at  Edinburgh  for  studies  of  content  addressable  memory  and  storage  properties, 
for  exploring  training  strategies,  for  protein  secondary  structure  prediction  and  for  texture 
discrimination.  In  the  last  case,  the  simulator  runs  at  about  10  Mflops  on  an  array  of  17 
transputers;  on  larger  problems,  the  performance  should  scale  up.  External  use  includes  in 
the  oil  and  defence  industries,  and  in  assessment  for  credit  scoring  in  the  banking  industry. 

Neural  networks  also  provide  a  framework  for  tackling  optimisation  problems.  An  example 
of  current  interest  includes  the  use  of  the  method  of  analogue  neurons  for  image  restoration 
in  the  framework  of  Geman  and  Geman.  Interest  in  optimisation  problems  has  evolved  also 
into  the  study  of  genetic  algorithms,  in  particular  for  the  optimisation  of  transputer  array 
topology  for  some  application.  For  further  examples  and  references  on  ECS  work  in  these 
areas  see  [15]. 

5.2.4  Chemical  Process  Simulation  and  Control 

In  most  chemical  plants  and  all  oil  refineries  distillation  is  an  important  operation.  Much 
attention  is  devoted  to  controlling  distillation  columns  efficiently,  since  these  are  one  of  the 
largest  energy  sinks  in  the  process.  Problems  affecting  efficient  control  include:  columns 
consisting  of  many  stages  which  may  be  slow  to  respond  to  feedback  control  actions;  and 
stringent  specifications  on  final  product  purity.  If  computer  simulation  of  the  process  is  fast 
enough,  an  efficient  control  plan  can  be  designed  and  implemented  to  keep  the  column  at 
the  desired  production  specifications. 

The  implementation  on  the  ECS  involves  a  chain  of  transputers  for  each  column,  each 
transputer  being  responsible  for  a  module  or  plate  in  the  column.  In  the  simulation,  the 
dynamic  evolution  of  the  column’s  state  is  interactively  displayed.  Two  control  policies 
have  been  examined:  feedforward  control  with  change  of  composition  of  feed;  and  product 
changeover,  i.e.  switch  of  production  from  product  A  to  product  B  with  the  minimum 
‘off-spec’  production. 

The  conclusions  from  this  work  by  McKinnel  and  Ponton  [16]  are  that  modestly-sized 
transputer-based  systems  are  now  sufficiently  powerful  and  cost-effective  to  be  considered 
for  dedicated  use  ‘on-line’  for  this  problem. 

5.3  Design:  Optimisation  in  stressed  membrane  suface  structures 

An  important  Engineering  application  is  being  undertaken  by  Moncriefl  and  Topping  at 
Heriot  Watt  University  [17,18].  This  concerns  the  optimisation  of  cutting  patterns  for  tension 
structures  such  as  the  Schlumberger  Headquarters  in  Cambridge.  The  aim  is  to  improve 
current  methods  for  optimising  stress  distributions  across  the  surface.  The  conventional  non¬ 
linear  modelling  approach  starts  from  some  arbitrary  surface  and  progressively  relaxes  this 
to  an  equilibrium  configuration;  this  requires  specifying  a  desired  surface  stress  distribution 
and  an  appropriate  topology.  The  real  difficulty  is  that  one  must  define  a  cutting  pattern 
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for  flat  cloths  from  which  the  real  surface  will  be  fabricated.  One  is  therefore  faced  with 
optimising  the  cutting  pattern  by  varying  the  planar  cloth  geometry.  This  is  just  one  example 
of  an  important  class  of  non-linear  optimisation  problems  in  which  new  design  points  are 
iteratively  determined  by  displacement  along  a  search  vector  by  a  specific  amount. 

In  practice  a  hierarchical  CAD  data  structure  is  used,  in  which  the  surface  patches  are 
defined  in  terms  of  discrete  space  curves.  These  curves  are  in  turn  defined  using  lists  of 
atomic  control  points,  which  are  the  variables  over  which  the  optimisation  is  performed.  The 
non-linearity  of  the  problem  requires  that  the  gradients  for  the  determination  of  the  search 
vector  must  be  calculated  numerically.  Having  determined  a  search  vector,  a  linear  search 
must  be  preformed  along  this  direction  to  determine  a  good  step  length.  The  gradients  of 
the  objective  and  constraint  functions  with  respect  to  each  control  point  can  be  calculated 
independently,  as  can  the  trial  step  lengths,  so  that  the  whole  calculation  can  be  done 
efficiently  by  task  farming.  Since  in  real  structures  there  may  be  hundreds  of  variables,  and 
this  is  by  far  the  most  time-consuming  part  of  the  computation,  the  potential  for  parallel 
computation  is  vast. 

The  method  has  been  implemented  on  the  ECS,  and  near-linear  speed-up  observed.  The  use 
of  the  ECS  has  enabled  interactive  design  to  replace  overnight  batch  runs  on  the  Edinburgh 
University  mainframe. 


6  Concluding  Remarks 

In  this  paper  we  have  presented  a  snapshot  in  the  development  of  a  large  transputer  array 
facility,  with  emphasis  on  a  number  of  applications  where  performance  is  a  crucial  factor 
in  the  feasibility  and  success  of  the  problem.  The  range  of  work  underlines  the  potential 
of  massively  parallel  machines  in  supercomputer  applications.  We  have  also  stressed  the 
importance  of  the  development  of  tools  and  environments  to  facilitate  ease  of  porting  and 
future  portability.  For  distributed  memory  machines,  we  are  still  at  the  beginning  of  this 
process,  but  already  it  is  clear  that  sufficient  progress  has  been  made  to  establish  the  scientific 
value  and  commercial  viability  of  this  technology.  In  drawing  an  overall  perspective,  one 
would  conclude  that  the  two  key  factors  for  its  continued  commercial  success  are  tools  for 
efficient  porting  of  large  C  and  Fortran  codes,  and  competitive  microprocessor  development, 
incorporating  commensurate  communications  and  routing  capabilities. 
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Abstract 

Capable  of  executing  multiple  scalar  operations  per  cycle,  a  superscalar  architecture  can  parallelize  not 
just  vectorizable  programs,  but  also  code  containing  recurrences  and  data  dependent  control  flow.  This 
paper  presents  an  overview  of  die  compiler  optimizations  that  are  crucial  in  harnessing  the  computation 
power  of  superscalar  machines.  These  optimizations  include  high-level  loop  transformations  to  find 
parallelism  and  improve  the  efficiency  of  caches,  software  pipelining  and  hierarchical  reduction 
techniques  for  scheduling  instructions,  and  modulo  variable  expansion  for  assigning  registers. 

Recent  advances  in  hardware  technology  have  made  superscalar  architectures  amenable  to  single-chip 
implementations.  The  combination  of  cheap  hardware  to  provide  a  high  raw  computing  power  and 
sophisticated  compiler  technology  to  effectively  use  the  parallelism  can  produce  extremely  low-cost, 
high-performance  workstations  that  are  easily  accessible  to  the  general  scientific  and  engineering 
community. 

1.  Introduction 

A  superscalar  computer  is  a  uniprocessor  that  can  execute  two  or  mote  scalar  operations  in  parallel. 
The  operations  are  individually  specified  in  the  object  code;  this  is  distinct  from  vector  machines  which 
expand  vector  instructions  into  series  of  parallel  operations.  The  parallelism  of  a  vector  instruction  is 
defined  for  each  vector  machine  at  machine  design  time;  on  a  superscalar  machine,  a  parallel  execution 
schedule  is  created  uniquely  for  each  program,  by  either  hardware  or  software.  As  a  result,  superscalar 
machine  organizations  are  more  versatile  and  effective  in  using  the  hardware  resources  in  the  system. 

Superscalar  machines  existed  long  before  the  term  was  coined.  The  IBM  Stretch  [5],  the  CDC 
6600  [24]  and  the  IBM  360/91  [2]  are  all  superscalar  architectures  that  can  execute  multiple  operations  in 
parallel.  These  machines  all  implement  a  sequential  instruction  set  with  hardware  that  schedules  the 
instructions  dynamically.  Besides  hardware,  software  has  also  been  used  for  instruction  scheduling. 
Epitomizing  the  clast  of  superscalar  machines  that  rely  on  software  for  scheduling  instructions  is  the 
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VLIW  (Very  Long  Instruction  Word)  architecture  [13].  Each  wide  instruction  word  explicitly  specifies 
the  operations  to  be  executed  in  parallel.  Examples  of  such  machines  include  the  Multiflow’s  Trace 
machines  [8],  the  Carnegie  Mellon's  processors  for  the  Warp  systolic  array  [3]  and  the  Cydrome’s  Cydra 
5  [9].  The  recent  hardware  technology  advances  have  made  software  scheduled  superscalar  architectures 
amenable  to  single-chip  implementations.  A  follow-on  of  the  Warp  processor,  the  Carnegie  Mellon  and 
Intel's  iWarp  processor  integrates  high-performance  computation  and  systolic  communication  in  a  single 
component  [6].  The  Intel’s  i860  is  a  single-chip  microprocessor  that  can  perform  up  to  100  million 
floating-point  per  seconds  (MFLOPS)  using  a  dual -instruction  word  format  [17], 

The  development  of  the  recent  superscalar  architectures  presents  an  exciting  prospect  to  the  engineering 
and  scientific  community.  As  technology  improves,  the  superscalar  processor  performance  is  expected  to 
grow.  The  superscalar  processor  provides  a  mote  flexible  form  of  instruction  parallelism  in  a  low-cost 
package.  The  impact  is  that  high  computing  power  can  be  easily  provided  in  a  low-cost  desktop 
workstation  that  is  widely  accessible  to  engineers  and  scientists.  The  high-level  of  integration  also  makes 
these  scalar  processors  a  useful  building  block  for  large-scale  multiprocessing,  thus  delivering  an 
aggregate  computation  bandwidth  higher  than  ever  before. 

The  parallelism  of  a  superscalar  machine  may  be  managed  in  hardware  or  software.  The  hardware 
approach  schedules  the  instructions  dynamically,  thus  hiding  parallelism  from  the  architecture.  The 
instruction  set  architecture  can  therefore  be  made  compatible  with  that  of  an  existing  sequential  machine. 
Run-time  scheduling,  however,  requires  more  hardware  logic,  which  may  result  in  a  slower  clock  cycle  or 
longer  latency  in  instruction  execution.  In  the  software  approach,  the  parallelism  is  exposed  at  the 
architecture  level,  and  the  compiler  is  responsible  for  specifying  the  parallel  operations  to  execute.  By 
analyzing  the  entire  program  statically,  the  compiler  can  exploit  higher  level  program  semantics  and 
rearrange  the  code  globally  to  derive  a  better  schedule. 

To  harness  the  raw  computing  speeds  of  software-scheduled  superscalar  processor  in  applications, 
compiler  technology  is  crucial.  The  compiler  hides  the  parallelism  from  the  programmer,  so  the 
programmer  can  develop  applications  easily  using  a  high-level  sequential  language.  This  approach  has 
the  additional  advantage  that  the  same  sequential  programs  can  now  easily  be  ported  to  other  current  and 
future  machine  architectures. 

In  this  paper,  we  first  describe  the  characteristics  of  the  superscalar  architecture  and  the  issues  in 
compiling  code  for  such  machines.  We  then  present  a  set  of  compiler  optimizations,  showing  how  the 
functionality  of  the  processor  can  be  used  in  programs.  We  then  close  with  a  discussion  on  the 
performance  of  these  superscalar  machines. 

2.  Superscalar  Architectures 

Common  to  all  superscalar  processors  is  the  presence  of  parallel  and/or  pipelined  functional  units.  Like 
any  machine  that  employs  parallelism  and  pipelining,  a  program  running  on  a  superscalar  seldom 
achieves  the  peak  computation  rate  of  the  machine.  If  a  superscalar  processor  has  n  functional  units,  or  a 
functional  unit  with  n  pipeline  stages,  n  independent  operations  must  be  present  at  all  times  to  utilize  the 


166 


machine  fully.  If  no  parallelism  is  found,  the  machine  may  operate  at  1/nth  of  the  peak  rate.  Therefore, 
for  a  superscalar  to  be  effective,  it  is  important  that  the  scheduler  can  find  enough  independent  operations 
to  execute  in  parallel. 

Before  we  discuss  the  scheduling  techniques,  let  us  first  take  a  look  at  the  fundamental  limit  the 
hardware  imposes  on  the  execution  speed  of  a  program.  Even  if  there  are  enough  independent  operations, 
the  full  computation  power  of  a  superscalar  may  not  be  brought  to  bear  on  an  application  because  of 
specialization.  The  processor  typically  consists  of  a  set  of  specialized  functional  units,  some  memory 
access  units,  possibly  different  arithmetic  units,  and  an  instruction  branch  control  unit  For  example,  a 
program  that  requires  no  multiplications  will  not  be  able  to  take  advantage  of  the  multiplication  unit  on 
the  processor. 

The  hardware  of  a  system  is  typically  designed  such  that  the  distribution  of  the  computational  units 
matches  the  distribution  of  operations  in  a  typical  program.  From  the  statistics  of  a  large  set  of  numerical 
applications  [18],  we  have  observed  that  there  are  about  as  many  floating-point  arithmetic  operations  as 
memory  operations.  About  60%  of  the  memory  operations  are  read  operations,  and  about  70%  of  the 
floating-point  operations  are  additions.  On  a  machine  that  can  execute  one  memory  read,  one  memory 
write,  one  floating-point  addition,  and  one  floating-point  multiplication  in  a  single  cycle,  the  adder  is 
often  the  critical  resource  and  is  followed  by  the  memory  read  unit 

Besides  the  utilization  of  the  functional  units  in  a  processor,  it  is  also  important  to  consider  the  memory 
subsystem.  To  support  a  high  computation  bandwidth,  a  processor  must  also  have  a  similarly  powerful 
memory  subsystem.  For  a  vector  machine,  the  more  restricted  mode  of  operation  permits  the  use  of 
vector  registers  and  efficient  block  transfers  between  the  memory  subsystem  and  the  registers.  Being  able 
to  support  a  less  regular  form  of  parallelism,  a  superscalar  architecture  requires  a  more  flexible  memory 
system.  The  concept  of  memory  hierarchy  has  been  shown  to  be  useful  in  reducing  the  average  access 
latencies  for  general-purpose  machines.  A  cache  can  also  reduce  the  number  of  memory  accesses  which 
can  be  important  in  a  multiprocessor  environment 

Unfortunately,  a  cache  sometimes  behaves  rather  poorly  for  numerical  code.  Because  of  the  large  data 
set  used,  data  brought  into  the  cache  may  be  flushed  out  before  they  are  reused.  The  cache  hit  rate  can 
fluctuate  widely  depending  on,  for  example,  whether  a  matrix  operand  is  in  the  cache.  This  may  greatly 
affect  the  overall  speed  obtained  due  to  the  large  difference  between  cache  and  memory  speeds.  While  a 
cache  is  normally  transparent  to  compilers  for  general-purpose  programs,  it  is  beneficial  to  optimize  the 
cache  behavior  in  superscalar  compilers. 

In  many  ways,  a  superscalar  compiler  faces  similar  issues  as  those  of  a  vectorizing  compiler.  The 
compiler  must  extract  parallelism  from  sequential  programs  and  try  to  use  the  parallel,  specialized 
functional  units  effectively.  The  compiler  must  also  manage  the  cache;  this  is  analogous  to  the 
management  of  vector  registers  in  vectorizing  compilers.  Though  the  issues  are  similar,  a  superscalar 
machine  presents  new  challenges  to  compiler  optimization.  The  parallelism  must  be  managed  at  the 
scalar  operation  level  and  the  parallelism  exploitable  is  not  regular  like  vector  instructions. 
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3.  Overview  of  Compiler  Techniques 

There  are  two  levels  of  compiler  optimization:  the  loop  level  and  the  instruction  level.  The  loop  level 
involves  higher  level  transformations  on  the  loop  structure.  These  transformations  are  useful  both  for 
bringing  parallelism  to  the  innermost  loop  as  well  as  improving  data  locality.  This  high-level 
restructuring  prepares  the  loop  for  low-level  instruction  scheduling. 

The  instruction  level  optimization  consists  of  instruction  scheduling  and  register  assignment 
techniques.  The  scheduling  problem  is  to  find  the  shortest  instruction  schedule  that  satisfies  the 
constraints  imposed  by  the  machine  resources  and  the  program  semantics.  In  particular,  since  most  of  the 
computation  time  is  spent  on  innermost  loops,  it  is  important  to  schedule  such  loops  efficiently.  Software 
pipelining  is  a  scheduling  technique  that  exploits  the  repetitive  nature  of  innermost  loops  to  generate 
highly  efficient  code  for  processors  with  parallel,  pipelined  functional  units  [19, 22, 25].  Another  code 
scheduling  technique  used  with  software  pipelining  is  hierarchical  reduction ,  a  technique  that  abstracts 
control  constructs  as  operations  in  a  basic  block,  so  the  same  scheduling  algorithms  can  be  applied  to 
within  and  across  basic  blocks.  For  example,  using  hierarchical  reduction,  software  pipelining  can  be 
applied  to  all  innermost  loops,  including  those  containing  conditional  statements.  Hierarchical  reduction 
makes  it  possible  to  obtain  a  consistent  performance  improvement  for  many  more  programs.  Interacting 
with  code  scheduling  is  register  assignment  When  the  same  register  is  assigned  to  different  variables, 
their  uses  must  be  serialized,  thus  constraining  the  parallelism  in  the  computation.  Therefore,  the  register 
assignment  must  also  be  considered  hand-in-hand  with  instruction  scheduling. 

In  the  following,  we  first  present  an  overview  of  the  analysis  techniques  necessary  to  support  both  loop 
level  and  instruction  level  parallelism.  We  then  discuss  each  of  the  optimizations:  loop  level 
transformations,  software  pipelining,  hierarchical  reduction  and  register  assignment 

Program  semantics  produces  two  kinds  of  constraints:  control  dependences  and  data  dependences.  A 
conditional  branch  instruction  must  first  be  executed  to  determine  the  instruction  to  execute  next  This 
sequencing  constraint  is  known  as  control  dependence.  An  operation  cannot  execute  until  all  its  operands 
are  produced.  This  sequencing  constraint  is  known  as  true  data  dependence.  To  ensure  that  a  read 
operation  always  reads  die  latest  value  produced,  the  order  of  the  write  operations  on  the  same  location 
must  also  be  observed.  This  sequencing  constraint  is  known  as  output  dependence.  Furthermore,  since  a 
data  location  may  hold  different  values  at  different  times,  a  value  must  not  be  overwritten  before  its  use. 
This  form  of  data  dependence  is  known  as  anti-dependence. 

The  compiler  must  first  extract  dependence  constraints  from  the  program.  The  analysis  algorithms  are 
similar  to  those  previously  used  for  vectorizing  and  concurreatizing  compilers.  The  control  dependence 
can  either  be  obtained  through  analysis  of  the  flow  graph  [11],  or  simply  retained  from  the  syntactic 
control  structure  of  the  program  [16],  For  data  dependence,  since  array  references  are  very  common  in 
numerical  code,  it  is  important  to  determine  if  two  array  references  can  refer  to  the  same  location,  and 
thus  may  share  a  dependence  relationship  between  them.  Various  dependence  tests  have  been  proposed 
for  disambiguating  between  array  references  whose  indices  are  an  affine  function  of  loop 
indices  [1,4, 27]. 
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The  dependence  information  was  used  previously  only  for  source-to- source  loop  transformations.  For  a 
superscalar  machine,  this  information  is  used  at  both  the  loop  and  instruction  level.  In  the  compiler 
currently  developed  at  Stanford,  data  dependence  is  captured  in  an  intermediate  representation  that 
supports  loop  level  transformations,  and  this  same  information  can  be  used  in  the  code  generation  phase. 

4.  Loop  Level  Transformations 

High  level  code  transformations  are  useful  in  bringing  parallelism  into  the  innermost  loop,  as  well  as 
improving  the  efficiency  of  the  caches.  Consider  the  simple  example  of  a  matrix  multiplication: 

FOR  i  0  TO  n-1  DO 
FOR  j  0  TO  n-1  DO 

FOR  k  :«  0  TO  n-1  DO 

C[i,j]  A[i,k]*B[k,  j]+C(i,  j]; 

The  result  of  one  addition  is  used  by  the  addition  in  the  next  iteration  of  the  loop.  The  additions  must 
therefore  execute  sequentially;  with  an  n-stage  pipelined  adder,  an  iteration  takes  at  least  n  clocks.  The 
multiplications,  being  independent,  can  execute  in  parallel  with  the  additions.  (Unlike  a  vector  machine, 
a  superscalar  machine  can  execute  some  instructions  in  parallel  even  for  recurrences.)  To  further  increase 
the  utilization  of  the  machine,  the  compiler  must  perform  higher  level  transformations  so  as  to  expose 
more  parallelism  in  the  innermost  loop  to  the  instruction  scheduler.  In  this  example,  if  the  inner  two 
loops  are  interchanged  as  follows: 

FOR  i  0  TO  n-1  DO 
FOR  k  :■  0  TO  n-1  DO 

FOR  j  0  TO  n-1  DO 

C[i,j]  A[i,k]*B[k,  j]+C[i,  j]; 

The  iterations  in  the  innermost  loop  are  now  independent;  as  many  iterations  as  necessary  can  execute  in 
parallel  to  fully  utilize  the  hardware  resources  of  die  machine.  Therefore,  when  the  innermost  loop  does 
not  contain  enough  parallel  operations  to  keep  the  hardware  resources  busy,  high  level  transformations, 
similar  to  those  used  in  vectorizing  and  parallelizing  compilers,  should  be  applied. 

For  superscalar  machines  with  caches,  high  level  transformations  can  also  be  used  to  improve  overall 
performance  by  reducing  the  cache  miss  rate.  Consider  a  machine  whose  csche  is  relative  small  in 
comparison  with  the  matrix  size.  The  objective  of  the  optimization  is  to  minimize  memory  accesses  by 
reusing  data  in  the  cache  as  much  as  possible.  In  the  optimized  program  above,  the  innermost  loop 
accesses  rows  k  and  i  of  matrices  B  and  C,  respectively.  The  same  row  of  C  is  used  in  the  next  outer 
loop,  but  the  B  data  will  not  be  reused  until  the  next  iteration  in  the  outermost  loop.  If  the  data  size  is 
large  compared  to  the  cache,  even  the  C  data  may  not  be  in  the  cache,  let  alone  the  B  data.  Maximum 
reuse  is  obtained  if  we  can  block,  or  tile,  the  computation  as  follows: 

FOR  ii  0  TO  n-1  BY  b  DO 
FOR  kk  0  TO  n-1  BY  b  DO 

FOR  jj  0  TO  n-1  BY  b  DO 

FOR  i  ii  TO  niin(ii+b-l,  n)  DO 

FOR  k  kk  TO  min(kk+b-l,  n)  DO 

FOR  j  :«  jj  TO  min(jj+b-l,  n)  DO 

C  [i,  j  J  A[i,k]*B[k,  j)+C[i,i); 

Each  of  the  matrix  elements  brought  into  the  cache  is  reused  b  times  before  it  is  removed  from  the  cache. 
The  value  of  b  is  chosen  to  maximize  the  cache  utilization. 
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Previous  research  on  data  locality  has  provided  ways  to  predict  the  cache  behavior  of  a  loop  nest. 
Gannon  et  al.  [14]  use  uniformly  generated  references  to  find  where  locality  exists  in  a  nesting  of  loops. 
They  also  discuss  how  to  choose  which  array  elements  should  go  into  the  cache  for  a  given  loop. 
Porterfield  [21]  estimates  cache  behavior  for  a  loop  nest  assuming  that  the  cache  uses  the  least  recently 
used  replacement  policy,  and  may  block  a  loop  if  the  cache  cannot  hold  all  the  data  in  an  iteration. 
Gannon  et  aL's  and  Porterfield's  estimates  can  be  used  to  evaluate  the  data  locality  of  entire  loop  nests 
obtained  by  different  sets  of  transformations. 

Loop  transformations  beneficial  to  data  locality  and  parallelism  for  superscalar  machines  include  loop 
interchange,  reversal,  skewing  and  tiling.  Wolf  and  I  have  developed  an  efficient  algorithm  to  search 
through  the  space  of  these  transformations  and  generates  code  that  displays  data  locality  and  parallelism 
in  the  innermost  loops  [26].  We  reduce  the  optimization  problem  to  placing  the  maximum  number  of 
loops  identified  to  cany  locality  in  the  innermost  tile.  Using  this  goal  and  the  legality  considerations  of 
tiling,  we  can  significantly  prune  the  search  space  to  fmd  the  best  set  of  transformations.  How  tiling 
improves  data  locality  has  been  illustrated  by  the  example  above.  The  conditions  that  made  tiling  legal  in 
the  first  place  guarantee  both  coarse  and  fine  grain  parallelism  within  a  tiled  loop.  Therefore,  by  tiling  the 
loops,  we  generate  code  that  exhibits  both  data  locality  and  parallelism. 

5.  Software  Pipelining 

After  performing  the  high-level  transformations,  the  compiler  can  then  apply  the  instruction  level 
optimizations.  The  basic  technique  for  obtaining  parallelism  is  software  pipelining.  Let  us  introduce  the 
concept  of  software  pipelining  by  way  of  an  example.  Suppose  we  have  a  machine  that  can  perform  one 
load,  one  store,  and  initiate  a  7-stage  pipelined  floating  operation  in  one  instruction,  and  suppose  the  code 
we  want  to  execute  is: 

FOR  i  1  TO  n  DO 
A[i]  A[i] +1 . 0; 

Assume  for  now  that  we  can  generate  die  addresses  for  the  loads  and  stores  in  parallel  with  the  rest  of  the 
computation;  the  specifics  of  this  topic  will  be  discussed  in  Section  7.  The  most  compact  instruction 
sequence  to  execute  a  single  iteration  of  this  loop  is  given  in  Figure  5-1.  The  operation  BLoop  1 
branches  back  to  label  1  if  there  are  more  iterations  to  execute.  The  schedule  is  sparse  due  to  the  heavy 
pipelining  in  the  data  path.  (For  machines  with  hardware  interlocks,  the  nop  instructions  are  used  only  at 
code  scheduling  time;  they  are  omitted  when  the  code  is  emitted.)  If  we  simply  iterate  this  schedule,  the 
throughput  of  die  loop  is  only  1  iteration  every  9  clock  ticks,  and  no  resources  are  used  more  than  l/9th  of 
the  time. 

1:  LD 
TADD 
nop 
nop 
nop 
nop 
nop 
nop 

ST  BLoop  1 

Figure  5-1:  Object  code  for  one  iteration  in  example  program 
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Techniques  such  as  trace  scheduling  [12]  depend  on  loop  unrolling  to  generate  enough  parallel 
instructions  to  schedule.  Suppose  the  loop  body  of  the  example  is  unrolled  9  times,  die  optimal  schedule 
of  the  body  of  the  unrolled  loop  is  shown  in  Figure  5-2.  (This  instruction  sequence  assumes  that  the 
number  of  iterations  is  divisible  by  9.)  Each  row  in  the  figure  corresponds  to  operations  in  an  instruction, 
and  each  column  corresponds  to  the  computation  of  one  iteration  of  the  loop  in  the  source  progran 
Unrolling  the  loop  9  times  improves  the  throughput  to  9  iterations  every  17  clocks.  From  the  figure,  it  is 
clear  that  unrolling  an  additional  iteration  will  only  lengthen  the  schedule  by  (me  clock.  This  can  be  kept 
up  until  the  iterations  run  out  A  loop  unrolled  u  times  will  have  a  throughput  rate  of  «/(«- 1-8)  iterations 
per  clock,  while  the  ideal  throughput  is  1  iteration  per  clock. 

1:  LD 

FADD  LD 

FADD  LD 

FADD  LD 

FADD  LD 

FADD  LD 

FADD  LD 

FADD  LD 

ST  FADD  LD 

ST  FADD 

ST 

ST 

ST 

ST 

ST 

ST 

ST  Bloop  1 

Figure  5-2:  Optimal  schedule  for  nine  iterations 

Although  the  schedule  improves  as  we  unroll  more  iterations,  code  expansion  limits  the  degree  of 
unrolling.  Unrolling  can  therefore  overlap  only  a  small  finite  number  of  iterations;  all  the  unrolled 
iterations  must  complete  before  the  program  branches  back  to  another  set  of  contiguous  iterations.  On  a 
vector  machine,  such  a  loop  maps  directly  into  a  vector  instruction;  a  vector  instruction  can  continually 
overlap  operations  from  successive  iterations  to  deliver  a  throughput  of  one  iteration  per  clock  cycle. 

Software  pipelining  can  achieve  the  same  kind  of  performance  obtained  with  vector  instructions  by 
continually  overlapping  operations  from  different  iterations  of  a  loop.  The  software  pipelined  program 
for  the  example  above  is  shown  in  Figure  5-3.  Code  generated  by  software  pipelining  is  compact  The 
code  in  the  figure  assumes  that  there  are  at  least  nine  iterations  in  the  loop.  The  first  eight  instructions 
constitute  the  prolog  where  more  and  more  iterations  of  the  loop  start  to  execute.  The  steady  state  is 
reached  after  eight  instructions,  and  is  repeated  until  all  iterations  have  been  initiated.  In  die  steady  state, 
nine  iterations  are  in  progress  at  the  same  time,  with  one  iteration  starting  up  and  one  finishing  off  every 
clock.  On  leaving  the  steady  state,  the  iterations  currently  in  progress  are  completed  in  die  epilog,  the 
10th  through  17th  instructions.  This  program  achieves  the  optimal  computation  time  by  executing  n 
iterations  in  n+8  clock  ticks,  where  n  is  the  number  of  iterations  in  the  loop. 

Software  pipelining  is  different  from  loop  unrolling  in  that  a  source  iteration  may  span  one  or  more 
iterations  in  the  object  code.  If  the  machine  contains  pipelined  functional  units,  the  pipeline  stages  need 
not  be  emptied  at  iteration  boundaries.  In  the  example  above,  seven  additions  initiated  in  seven  different 
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1: 


LD 

FADD  LD 

FADD  LD 

FADD  LD 

FADD  LD 

FADD 
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ST 


ST 


ST 


LD 

FADD  LD 

FADD 


ST 

ST 


LD 

FADD  LD  BLoop  1 
FADD 


ST 

ST 


Figure  5-3:  Program  of  a  software  pipelined  loop 

iterations  execute  in  parallel.  The  hardware  pipelines  are  filled  and  drained  only  once  on  entering  and 
exiting  the  loop,  respectively.  Software  pipelining  is  especially  beneficial  for  machines  with  high  degrees 
of  parallelism  and  specialization.  The  results  are  that  optimal  throughput  can  be  achieved,  and  achieved 
with  an  extremely  compact  program. 

5.1.  The  Problem 

In  this  section,  we  first  concentrate  on  the  scheduling  of  loops  containing  a  single  basic  blocks. 
Extending  software  pipelining  to  other  loops  is  discussed  with  hierarchical  reduction  in  the  next  section. 
The  primary  goal  of  software  pipelining  is  to  maximize  the  throughput  in  executing  the  iterations;  it  does 
not  matter  if  the  execution  time  of  individual  iterations  is  lengthened.  Its  secondary  goal  is  to  keep  the 
code  size  down.  In  other  words,  the  schedule  must  have  a  short  steady  state  so  that  it  can  be  captured  in  a 
relatively  succinct  code  sequence.  The  problem  is  thus  formulated  as  finding  a  common  schedule  for  all 
iterations  of  the  source  loop,  such  that  successive  iterations  are  initiated  with  a  constant  interval,  and  the 
objective  is  to  minimize  this  interval.  In  the  example  above,  the  schedule  of  an  iteration  is  given  in 
Figure  3-1,  and  the  Iteration  initiation  interval  is  one. 

Software  pipelining  was  originally  derived  from  a  technique  for  scheduling  hardware  pipelines,  where 
the  problem  was  formulated  as  inserting  delays  between  hardware  units  to  increase  the  overall  throughput 
of  the  system  [20].  New  input  is  accepted  by  the  hardware  pipeline  at  regular  periodic  intervals.  The 
software  analog  is  to  schedule  operations  within  an  iteration  such  that  the  iterations  can  be  pipelined  to 
yield  optimal  throughput 

Software  pipelining  has  been  used  in  compilers  for  several  different  architectures.  The  algorithm  was 
first  used  in  the  ESL  polycyclic  architecture  [22],  The  polycyclic  machine  uses  a  specialized  crossbar  to 
simplify  the  scheduling  problem  for  a  subset  of  loops  [23],  The  same  concept  is  also  implemented  in 
Cydrome’s  Cydra  S  [9].  Software  pipelining  is  also  used  in  the  compiler  for  the  FPS-164  machine  [23]. 
The  FPS-164  does  not  have  any  specialized  support  for  software  pipelining,  and  software  heuristics  are 
used  to  schedule  the  loops.  We  improved  upon  the  FPS  heuristics,  especially  in  the  algorithm  for 


172 


scheduling  recurrences,  and  implemented  them  in  our  compilers  [7, 19]  for  the  Carnegie  Mellon's  Warp 
and  iWarp  machines.  Eisenbeis  et  al.  applied  software  pipelining  to  the  problem  of  scheduling  vector 
instructions,  and  implemented  a  compiler  that  generates  software  pipelined  vector  code  for  the  Cray-2 
architecture  [10]. 


Let  us  first  describe  some  of  the  fundamental  limits  in  scheduling  a  loop.  There  are  two  kinds  of 
constraints:  resource  and  precedence  constraints. 

Resource  Constraints.  Suppose  a  machine  has  m(r)  units  of  resource  r,  and  an  iteration  of  a  loop 
requires  n(r)  units  of  resource  r,  then  a  pipelined  loop  cannot  execute  faster  than  the  rate  of  at  most  one 
iteration  every 

maxr 

cycles.  This  equation  reconfirms  the  notion  that  it  is  harder  to  fully  utilize  highly  specialized  functional 
units  and  the  computation  rate  is  limited  by  the  resource  with  the  highest  demand. 

In  software  pipelining,  we  must  ensure  that  the  resource  commitment  in  each  clock  cycle  of  the  steady 
state  does  not  exceed  the  available  resources.  The  resource  usage  of  the  steady  state  can  be  represented 
by  a  modulo  resource  reservation  table  whose  ith  entry  contains  the  sum  of  the  resources  used  in  cycles 
i,  i+s,  i+2s, ...  of  the  schedule  of  an  iteration,  where  s  is  the  initiation  interval  of  the  loop. 

Precedence  constraints.  While  recurrences  limit  the  throughput  of  tte  computation,  a  superscalar, 
unlike  a  vector  machine,  can  often  still  find  some  parallelism  in  such  loops.  Consider  the  following 
example: 

FOR  i  :■  1  to  100  DO 
a  a  +  1.0; 

We  must  first  read  a  before  we  write  back  into  a  in  the  same  iteration,  which  in  turn  must  precede  the 
read  operation  in  the  next  iteration.  The  flow  graph  representing  the  above  example,  assuming  a  seven- 
staged  addition,  is  shown  in  Figure  5-4.  Each  edge  is  labeled  by  the  number  of  iterations  the  dependence 
crosses  and  the  delay  between  them.  As  shown  in  the  figure,  inter-iteration  data  dependences  may 
introduce  cycles  into  die  precedence  constraint  graph.  The  precedence  constraints  in  Figure  5-4  impose  a 
delay  of  9  clock  ticks  between  load  operations  from  consecutive  iterations.  That  is,  loops  cannot  execute 
at  a  rate  greater  than  one  iteration  every  9  clocks. 


r  n(r)  1 
m(r) 


We  define  the  minimum  delay,  d,  and  minimum  iteration  difference,  p,  of  a  path  to  be  the  sum  of  the 
minimum  delays  and  minimum  iteration  differences  of  the  edges  in  the  path,  respectively.  If  we  let  c 
denote  a  cycle  in  die  graph,  the  rate  at  which  the  iterations  can  be  executed  is  one  iteration  every 


fd(c 

P(c) 


cycles. 


The  maximum  of  the  two  bounds  determined  by  resource  and  precedence  considerations  establishes  a 
lower  bound  on  the  initiation  interval.  Therefore,  a  schedule  that  pipelines  with  an  initiation  interval 
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Figure  5-4:  (a)  Delays  between  operations  from  two  iterations,  and  (b)  precedence  graph 
meeting  the  lower  bound  is  optimal.  Empirical  results  show  that  this  lower  bound  can  indeed  be  met  in 
many  cases  [18]. 


5.2.  The  Algorithm 

The  problem  of  finding  die  optimal  software  pipeline  schedule  is  NP -complete.  For  acyclic  graphs,  the 
scheduling  problem  is  tractable  if  operations  execute  in  unit  time  and  use  only  one  resource.  The 
polycyclic  architecture  [22]  and  the  Cydra  5  architecture  [9]  use  a  specialized,  rather  expensive  crossbar 
to  provide  exactly  that  property.  All  functional  units  of  a  polycyclic  machine  are  interconnected  through 
a  crossbar.  This  crossbar  has  storage  at  every  crosspoint  to  serve  as  a  dedicated  buffer  for  each  pair  of 
functional  units.  Therefore,  there  is  never  any  contention  in  reading  or  writing  data.  Each  operation  thus 
consumes  only  one  explicitly  scheduled  resource.  For  acyclic  graphs,  the  minimum  initiation  interval  is 
given  by  the  bound  discussed  above  and  an  optimal  schedule  can  easily  be  found.  However,  the  problem 
remains  NP -complete  for  cyclic  graphs  even  if  operations  use  one  unit  of  resource  and  execute  in  one  unit 
time. 

Without  the  specialized  hardware  to  support  software  pipelining,  both  the  FPS  and  the  Warp/iWarp 
compilers  use  software  heuristics.  The  algorithms  used  for  scheduling  acyclic  graphs  are  similar,  but  the 
cyclic  graph  scheduling  algorithm  is  significantly  improved  in  our  Warp/iWarp  compilers.  The  algorithm 
for  acyclic  graphs  is  as  follows:  First,  establish  a  lower  and  an  upper  bound  on  the  initiation  interval. 
The  lower  bound  is  calculated  from  the  resource  and  precedence  constraints;  the  upper  bound  can  be 
found  by  the  schedule  of  a  single  loop  iteration.  Next,  find  the  smallest  initiation  interval.  Simple  linear 
search  is  used  in  our  Warp/iWarp  compiler  because  empirical  results  show  that  a  schedule  meeting  the 
lower  bound  can  often  be  found.  The  algorithm  first  sets  the  target  of  the  initiation  interval  to  be  the 
lower  bound  value,  and  attempts  to  find  a  pipelinable  schedule  for  the  target  initiation  interval  using  the 
method  described  below.  If  the  attempt  fails,  this  process  is  reiterated  by  increasing  the  target  initiation 
interval  by  one  clock  tick  at  a  time. 

The  basic  algorithm  used  to  find  a  software  pipelinable  schedule  for  a  target  initiation  is  list  scheduling. 
In  list  scheduling,  the  precedence  constraints  are  applied  first  to  determine  the  earliest  slot  in  which  an 
operation  can  be  scheduled.  The  scheduler  then  goes  on  to  try  to  satisfy  the  resource  constraints;  the 
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modulo  resource  reservation  table  defined  above  is  used  to  determine  if  there  is  a  resource  conflict  The 
scheduler  tries  to  schedule  the  operation  in  successive  time  slots  until  one  that  can  accommodate  its 
resource  requirement  is  found.  If  s  is  the  target  initiation  interval,  and  s  attempts  to  satisfy  the  resource 
constraints  fail,  by  the  definition  of  modulo  resource  usage,  this  operation  cannot  fit  into  the  schedule 
built  so  far.  When  this  happens,  the  attempt  to  find  a  schedule  for  the  given  initiation  interval  is  aborted 
and  the  scheduling  process  is  repeated  with  a  greater  interval  value. 

As  in  the  case  of  acyclic  graphs,  the  main  scheduling  step  for  cyclic  graphs  is  iterative.  For  each  target 
initiation  interval,  the  strongly  connected  components  are  first  scheduled  ind  ividually.  The  original  graph 
is  then  reduced  by  representing  each  strongly  connected  component  as  a  single  vertex:  the  resource  usage 
of  the  vertex  represents  the  aggregate  resource  usage  of  its  components,  and  edges  connecting  nodes  from 
different  connected  components  are  represented  by  edges  between  the  corresponding  vertices.  This 
reduced  graph  is  acyclic,  and  the  acyclic  graph  scheduling  algorithm  can  then  be  applied. 

Two  main  concepts  are  used  in  the  algorithm  for  scheduling  the  strongly  connected  components.  First, 
the  precedence  constraints  are  preprocessed  so  that  die  scheduler  can  easily  determine  the  legal  time  span 
in  which  any  node  can  be  scheduled.  Second,  the  order  in  which  the  instructions  are  scheduled  is 
designed  such  that  when  the  target  initiation  interval  value  is  increased,  the  chance  for  success  also 
improves.  This  is  important  because  it  would  be  futile  if  the  scheduling  algorithm  simply  retried  the  same 
schedule  that  failed. 

A  large  set  of  evaluation  data  on  the  Warp/iWarp  machine  indicates  that  provably  optimal  schedules 
can  often  be  found  [18],  This  shows  that  software  pipelining  does  not  require  expensive  hardware 
support  The  code  generated  is  compact;  the  body  of  a  software  pipelined  loop  is  even  shorter  than  the 
unoptimized  code. 

6.  Hierarchical  Reduction 

The  hierarchical  reduction  technique  is  designed  to  make  software  pipelining  applicable  to  all 
innermost  loops,  including  those  containing  conditional  statements.  The  proposed  approach  schedules  the 
program  hierarchically,  starting  with  the  innermost  control  constructs.  As  each  construct  is  scheduled,  the 
entire  construct  is  reduced  to  a  simple  node  representing  all  the  scheduling  constraints  of  its  components 
with  other  constructs.  This  node  can  then  be  scheduled  just  like  a  simple  node  within  the  surrounding 
control  construct  The  scheduling  process  is  complete  when  an  entire  program  is  reduced  to  a  single 
node. 

The  use  of  the  construct  structure  exploits  high-level  control  dependence  knowledge  [11]  to  increase 
the  opportunity  for  code  motion.  As  an  example  of  the  kind  of  code  motions  achievable  with  this 
technique,  consider  the  following  program: 

TOR  i  0  to  n  DO 

BEGIN 

statement  1 ; 

IF  c  THEN  statement  2  ELSE  statement  3; 

statement  4; 

END 
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Although  statement  4  comes  after  the  conditional  statement,  it  is  not  control  dependent  upon  the  result  of 
the  condition  c.  Once  the  program  decides  to  execute  another  iteration,  it  can  execute  statements  1  and  4 
in  any  order  that  satisfies  the  data  dependences.  For  example,  an  operation  in  statement  4  can  be  executed 
before  the  conditional  statement  The  hierarchical  reduction  algorithm  first  schedules  the  THEN  and 
ELSE  parts  of  the  conditional  statement,  and  represents  the  entire  construct  with  a  single  node  that 
inherits  the  union  of  the  scheduling  constraints  for  each  of  the  two  parts  of  the  conditional  statement  The 
entire  construct  is  then  scheduled  with  statements  1  and  4.  Operations  corresponding  to  statements  1  and 
4  may  be  reordered,  they  may  also  execute  in  parallel  with  the  THEN  and  ELSE  components  of  the 
conditional  statement  At  code  emission  time,  any  code  scheduled  in  parallel  with  the  conditional 
statement  is  duplicated  in  both  die  THEN  and  ELSE  parts. 

This  control  dependence  knowledge  when  combined  with  software  pipelining  can  produce  surprisingly 
efficient  code.  The  loop  termination  test  for  the  next  iteration  can  be  performed  immediately  after  the 
decision  to  execute  the  current  iteration.  This  test  can  move  past  all  the  conditional  branches  in  the  body 
of  the  loop.  In  this  way,  hierarchical  reduction  exposes  many  more  parallel  operations  for  scheduling. 

Hierarchical  reduction  also  minimizes  the  penalty  of  short  vectors,  or  loops  with  small  number  of 
iterations.  The  prolog  and  epilog  of  a  loop  can  be  overlapped  with  scalar  operations  outside  the  loop;  the 
epilog  of  a  loop  can  be  overlapped  with  the  prolog  of  the  next  loop;  lastly,  software  pipelining  can  be 
applied  even  to  an  outer  loop.  In  summary,  hierarchical  reduction  makes  it  possible  to  exploit  parallelism 
in  a  much  larger  set  of  applications.  It  allows  loops  containing  conditional  statements  to  be  software 
pipelined,  and  it  finds  parallelism  within  loop  bodies  that  are  too  long  to  pipeline. 

7.  Modulo  Variable  Expansion 

If  traditional  register  assignment  were  performed  before  code  scheduling,  then  the  reuse  of  registers  for 
different  variables  would  significantly  reduce  the  potential  parallelism  in  the  code.  This  is  because  the 
objective  of  register  assignment  is  to  use  as  few  registers  as  possible.  A  register  is  recycled  in  the  shortest 
amount  of  time,  thus  creating  many  more  data  dependences  that  need  to  be  observed.  Cooperation  is 
therefore  required  between  code  scheduling  and  register  assignment  in  a  superscalar  compiler.  Proposed 
strategies  include  combining  register  assignment  with  scheduling  [IS],  and  postponing  register 
assignment  until  after  scheduling  [18].  The  latter  approach  simplifies  the  compiler  design  by  separating 
scheduling  and  register  assignment  into  two  different  phases.  The  drawback,  however,  is  that  there  may 
not  be  enough  registers  and  code  needs  to  be  inserted  to  spill  values  to  memory. 

There  is  one  form  of  register  reuse  that  can  greatly  inhibit  parallelization,  and  that  is  the  use  of  the  same 
register  for  the  same  variable  in  different  iterations  of  a  loop.  To  illustrate  this  point,  let  us  use  the  same 
example: 

TOR  i:«  0  TO  n  DO 
A(i]  A[i]+1 .0; 

For  the  sake  of  simplicity,  here  we  assume  that  a  floating-point  addition  takes  only  two  clocks.  The 
object  code  for  one  iteration,  complete  with  register  assignment,  is  as  lollows. 
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♦  R1  preloaded  with  address  of  A 

*  FR7  preloaded  with  1.0 

LD  FR1,  (Rl) 

FADD  FRl,  FRl 
nop 

ST  FR1, (Rl) 

ADD  R1,R1,4 

The  register  assignment  prevents  this  vectorizable  loop  from  executing  in  parallel.  The  register  FRl 
cannot  be  loaded  with  the  next  input  until  after  its  last  use  in  the  previous  iteration.  Similarly,  the  register 
Rl  cannot  be  incremented  until  the  last  store  operation  is  performed.  An ti -dependences  force  the  write 
operations  to  follow  all  the  read  operations  of  the  old  values;  consequently,  the  computation  must  execute 
serially. 

Modulo  variable  expansion  is  a  register  assignment  technique  that  eliminates  these  anti-dependences. 
The  following  is  the  result  of  applying  the  combination  of  software  pipelining  and  modulo  variable 
expansion  to  the  example  above. 

t  Rl  preloaded  with  address  of  A 
♦  FR7  preloaded  with  1.0 


LD 

FRl,  (Rl) 

FADD 

FRl,  FR7 

ADD 

R2 ,  Rl ,  4 

1: 

LD 

FR2, (R2) 

ST 

FRl, (Rl) 

FADD 

FR2 , FR7 

ADD 

Rl,  R2,  4 

LD 

FRl, (Rl) 

ST 

FR2, (R2) 

FADD 

nop 

FRl, FR7  ADD  R2 , Rl , 4  BDoop  1 

ST 

FRl, (Rl) 

To  eliminate  the  anti-dependence  constraint,  the  second  iteration  uses  a  different  set  of  registers,  R2 
and  FR2,  and  can  thus  overlap  with  the  first  The  third  iteration,  on  the  other  hand,  can  reuse  the  set  in 
the  first  iteration.  In  fact,  every  other  iteration  can  use  the  same  set  of  registers,  making  the  code  identical 
every  two  consecutive  iterations.  The  length  of  the  steady  state  is  just  twice  the  initiation  interval  and  the 
loop  body  is  therefore  still  very  compact. 

We  call  this  optimization  of  assigning  several  registers  to  a  loop  variable  modulo  variable  expansion. 
In  vectorizing  compilers,  scalar  variables  are  expanded  into  arrays  so  that  each  iteration  refers  to  a 
different  array  element,  making  the  loop  vectorizable.  Modulo  variable  expansion  takes  advantage  of  the 
flexibility  of  superscalar  machines,  and  reduces  the  number  of  registers  allocated  to  a  variable  by  reusing 
the  same  location  in  non-overlapping  iterations. 

A  tradeoff  can  be  made  between  the  degree  of  loop  unrolling  and  the  number  of  registers  used.  For  the 
Warp  machine  which  contains  a  relatively  large  number  of  registers,  minimizing  the  degree  of  unrolling  is 
a  better  choice  [19],  Eisenbeis  et  al.,  on  the  other  band,  minimizes  register  usage  because  their  target 
machine,  Cray-2,  has  only  eight  vector  registers  [10], 
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8.  Performance  of  Superscalar  Machines 

Having  functional  units  that  can  be  explicitly  controlled  by  software,  a  superscalar  processor  is  more 
versatile  than  a  vector  machine.  The  parallelism  on  a  vector  machine  is  restricted  to  the  set  of  vector 
instructions,  and,  if  chaining  is  supported,  parallelism  between  vector  instructions  that  use  different 
functional  units.  Using  software  pipelining  to  schedule  a  superscalar  with  similar  functional  units,  a 
simple  loop  that  corresponds  to  a  vector  instruction,  such  as  the  pairwise  additions  of  two  vectors,  can 
execute  at  the  same  throughput  rate  as  a  vector  instruction.  In  addition,  a  superscalar  can  fmd  parallelism 
in  complex  loops.  Loops  do  not  need  to  be  decomposed  into  simple  vector  instructions  which  require 
partial  expressions  be  buffered  in  vector  registers.  More  importantly,  a  superscalar  can  fmd  parallelism  in 
loops  with  recurrences  and  conditional  statements. 

The  ability  of  a  superscalar  machine  to  execute  custom  generated  parallel  code  eliminates  the  need  for 
buffering  vectors  of  partial  results.  For  example,  a  vectorizing  compiler  must  decompose  the  loop  in 
Figure  8-l(a)  into  two,  each  corresponding  to  a  vector-add  instruction  (Figure  8-1  (b)).  The  partial  sums 
must  be  buffered  in  a  vector  register.  On  a  superscalar  machine,  the  partial  results  can  be  operated  on  as 
soon  as  they  are  generated,  as  illustrated  in  Figure  8- 1(c).  This  reduces  the  number  of  registers  needed 
and  possibly  memory  accesses. 

(a)  FOR  i  0  TO  n  DO  BEGIN 

c[i]  a [i] +bli)+c [i] ; 

END; 

(b)  FOR  i  0  TO  n  DO  BEGIN 

t  til  :«  a [ i ] +b [ i ] ; 

END; 

FOR  i  0  TO  n  DO  BEGIN 
c[i]  :«  t [ i ) +c [ i ] ; 

END; 

(c)  FOR  i  0  TO  n  DO  BEGIN 

t  :•*  a  [i)+b(i] ; 
c[i]  t+c[i] ; 

END; 

Figure  8-1:  Reduced  register  requirement  in  superscalar  machines 
(a)  source  program,  (b)  vector  code,  (c)  scalar  code. 

A  recurrence  does  not  necessarily  mean  serial  execution  for  superscalar  machines.  As  long  as  there  are 
other  operations  that  can  execute  in  parallel  with  the  recurrence  computation,  a  high  computation  rate  can 
still  be  obtained  using  software  pipelining.  The  degree  of  parallelism  in  a  vectorized  loop  is  of  the  order 
of  the  number  of  iterations  in  the  loop.  A  recurrence,  however,  limits  the  degree  of  parallelism  by  the 
ratio  of  independent  operations  to  the  length  of  the  cyclic  dependence.  This  limited  form  of  parallelism 
can  be  exploited  in  superscalar  processors  because  of  their  unique  zero  synchronization  overhead.  The 
compiler  strategy  for  superscalar  machines  is  different  from  that  for  vector  machines.  A  vectorizing 
compiler  tries  to  decompose  a  loop  into  smaller  loops,  separating  recurrences  from  vectorizable  loops.  A 
superscalar  compiler,  on  the  other  hand,  tries  to  jam  independent  loops  together.  The  vectorizable  loop 
may  be  executed  on  the  idle  functional  units  while  the  program  computes  a  recurrence! 

In  addition  to  recurrences,  hierarchical  reduction  allows  us  to  fmd  parallelism  even  in  loops  with 
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conditional  statements.  Hierarchical  reduction  also  reduces  the  penalty  typically  associated  with  short 
vectors.  In  a  superscalar  machine,  the  scalar  operations  can  be  overlapped  with  the  prolog  and  epilog  of  a 
software  pipelined  loop.  This  easy  integration  of  scalar  and  vector  operations  makes  the  performance  of 
the  system  less  sensitive  to  the  size  of  the  data.  Moreover,  software  pipelining  can  be  applied  even  to 
outer  loops,  making  the  advantages  of  software  pipelining  applicable  even  for  programs  containing  short 
innermost  loops. 

The  instruction  scheduling  and  register  assignment  techniques  have  been  implemented  in  the  compilers 
for  the  Warp  and  iWarp  machines,  and  have  been  extensively  evaluated  [18],  The  Warp  processor  has  a 
peak  computation  rate  of  10  MFLOPS,  an  impressive  performance  for  a  machine  built  in  1986.  This  peak 
computation  rate  is  achieved  by  a  high  degree  of  parallelism  and  specialization.  In  a  single  instruction,  a 
Warp  processor  can  perform  one  7-staged  floating-point  addition,  one  7-staged  floating-point 
multiplication,  one  memory  read,  one  memory  write,  two  integer  operations,  and  one  branch  operation. 

We  have  analyzed  the  performance  of  a  set  of  seventy-two  programs  and  the  Livermore  kernels.  The 
performance  of  most  of  the  programs  fall  between  the  1  to  4  MFLOPS  range,  with  a  2.8  MFLOPS 
average.  This  utilization  of  resources  is  higher  than  that  typically  observed  in  supercomputers. 
Performance  analysis  of  the  software  pipeliner  shows  that  the  scheduler  is  successful  in  exploiting 
parallelism  once  the  parallelism  is  detected.  About  three-quarters  of  over  one  hundred  loops  pipelined  are 
provably  optimal.  When  compared  with  code  generated  by  a  compiler  that  fmds  parallelism  only  within  a 
basic  block,  most  of  the  loops  achieve  a  speed  up  of  between  two  and  six. 

9.  Conclusions 

This  paper  presents  an  overview  of  compiler  optimizations  that  exploit  parallelism  in  a  superscalar 
machine.  High-level  loop  transformations  improve  data  locality  and  place  parallelism  in  the  innermost 
loops,  in  preparation  for  instruction  level  optimizations.  Software  pipelining  is  the  basic  technique  that 
fmds  parallelism  across  iterations  in  inner  loops.  Hierarchical  reduction  helps  deliver  a  high  level  of 
performance  for  a  broader  range  of  applications,  for  example,  by  permitting  software  pipelining  to  be 
used  even  for  loops  with  conditional  statements.  And  lastly,  modulo  variable  expansion  eliminates 
dependence  constraints  due  to  reuse  of  registers  between  iterations. 

The  superscalar  architecture  is  a  promising  alternative  to  vector  machines.  We  now  have  compiler 
techniques  that  can  generate  highly  efficient  parallel  code  directly  from  user  programs.  Given  the  same 
hardware  functional  units,  a  superscalar  machine  delivers  the  same  performance  of  a  vector  machine  if  the 
program  is  vectorizable,  And  the  superscalar  machine  is  decidably  superior  to  vector  machines  when  the 
computation  contains  recurrences  and  conditional  statements.  A  superscalar  does  not  exhibit  a  dichotomy 
in  performance  depending  on  whether  the  code  is  vectorizable  or  not 

Compiler  optimizations  require  programs  to  be  analyzable  statically.  A  superscalar  architecture  has  an 
organization  that  is  more  easily  enhanced  to  handle  programs  that  are  not  amenable  to  static  analysis.  By 
a  judicious  use  of  hardware  to  provide  dynamic  information  to  cooperating  software,  processors  that 
deliver  a  consistent  high-performance  through  instruction  level  parallelism  are  possible. 
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Abstract 

Parallel  computer  architectures  and  hardware  have  evolved  impressively  in  the  last  few  years 
from  the  architecture  and  the  hardware  point  of  view.  Progress  on  the  software  side  can  be  best 
characterized  as  moderate.  The  lack  of  widely  acceptable  methodologies  and  software  to  support 
parallel  programming  is  profound  even  on  the  most  advanced  parallel  machines.  Parallel  pro¬ 
gramming  is  a  complex  task  and  the  performance  of  a  parallel  program  can  be  influenced  by 
many  different  factors  such  as  coding  of  parallel  constructs  and/or  restructuring,  scheduling 
schemes  and  scheduling  overhead,  synchronization  and/or  communication  cost,  program  and 
data  partitioning  and  memory  allocation.  In  this  paper  we  discuss  the  major  aspects  of  parallel 
programming.  Parallel  programming  environments  are  considered  in  three  fundamental  phases: 
parallelism  specification,  parallelism  exploitation,  and  supporting  environments  and  tools.  A  paral¬ 
lel  programming  environment  built  at  the  University  of  Illinois  is  discussed  as  a  case  study. 
Finally,  we  address  the  influence  of  parallel  programming  on  multiprocessor  operating  systems, 
and  discuss  future  research  directions. 
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1.  Introduction 

High-speed  computers  often  referred  to  as  supercomputers  have  invaded  such  fields  as  animation  and 
advertising,  graphics  and  industrial  computer-aided  design,  and  other  business  applications,  let  alone  more 
traditional  fields  such  as  weather  forecasting,  seismic  modeling  and  oil  exploration,  fluid  dynamics,  nuclear 
physics,  and  other  numeric-intensive  applications  in  science  and  engineering.  With  more  computational 
power  available,  scientists  can  study  problems  which  previously  were  impossible  to  model,  or  increase  the 
sise,  and  thus  the  accuracy  of  models  for  scientific  and  engineering  problems  which  are  already  in  use. 

Traditionally,  increased  computing  power  has  been  achieved  through  more  advanced  technologies 
that  allowed  higher  degrees  of  integration  and  switching  speeds,  more  efficient  packaging  and  cooling  and 
so  on.  Even  though  there  is  still  room  for  improvement  in  switching  speeds  and  integration  through  the 
evolution  of  new  technologies,  the  improvement  strides  will  probably  be  less  significant  than  in  the  past, 
and  will  soon  be  limited  due  to  physical  barriers.  Nevertheless,  the  demand  for  more  powerful  computer 
systems  keeps  increasing  and  even  today’s  moat  powerful  supercomputers  are  unable  to  solve  certain  prob¬ 
lems  in  a  reasonable  time  period.  This  limitation  prompted  researchers  to  investigate  architectural 
approaches  to  increased  computing  speeds  through  new  designs  and  enhancements  at  the  component  level. 
Some  of  these  architectural  solutions  were  more  or  less  radical  to  the  conventional  "von  Neuman"  architec¬ 
ture.  An  obvious  approach  and  one  which  has  gained  much  ground,  is  increasing  performance  through  the 
replication  of  conventional  processing  elements,  which  work  in  a  coordinated  manner  and  communicate 
with  each  other  through  some  type  of  an  interconnection  network. 

These  computer  systems,  well-known  as  parallel  processors  or  multiprocessor  machines,  are  the  sub¬ 
ject  of  this  paper.  We  focus  on  their  software  aspects,  and  more  specifically  on  the  programming  issues  of 
parallel  computers.  A  tremendous  amount  of  research  and  hardware  development  of  such  machines  has 
been  done  from  the  early  days  of  computing.  Many  different  architectures  have  been  proposed  and  many  of 
them  have  culminated  with  prototypes  or  commercial  systems.  Up  until  the  last  few  years  however,  the 
software  aspects  of  parallel  machines  had  been  overlooked.  This  lead  to  a  point  where  very  powerful  paral¬ 
lel  machines  can  be  built  but  cannot  be  used  to  their  fullest  potential.  Software  support  for  programming 
these  systems  was  minimal  up  until  recently,  and  it  is  still  inadequate. 

Although  this  has  become  apparent  and  has  attracted  attention  on  the  fundamental  issues  of  parallel 
programming,  we  are  still  far  from  understanding  the  general  and  global  nature  of  the  problem  called 
parallel  programming.  We  have  only  begun  to  see  parallel  languages,  let  alone  standards,  portability,  and 
programming  tools.  Of  course  parallel  programming  as  a  relatively  new  field  has  not  matured  yet.  Many 
crucial  problems  remain  unsolved  or  partially  solved.  It  is  true  that  in  parallel  programming  one  faces 
more  complex  problems  that  in  aerial  programming.  Another  issue  that  adds  more  complexity  to  parallel 
programming  is  the  variety  of  architectural  models  of  parallel  machines  that  have  not  yet  been  (or  cannot 
be)  abstracted  from  the  programming  level.  For  example,  our  knowledge  as  to  whether  our  target  model  is 
a  distributed  or  shared  memory  model  has  profound  ramifications  even  at  the  algorithm  design  level  (and 
of  course  at  the  program  development  level).  After  all,  it  may  not  be  even  possible  to  reach  the  same 
abstraction  and  portability  levels  for  parallel  programming,  as  the  case  has  been  for  serial  programming. 
In  the  former  case  the  primary  targets  are  efficiency  and  speed  which  bring  parallel  programming  closer  to 
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the  machine  level.  Compromising  speed  and  efficiency  for  portability  and  productivity  may  not  be  a  desir¬ 
able  solution. 

2.  Organisation 

In  this  paper  we  review  and  discuss  many  aspects  of  parallel  programming  and  summarise  recent 
approaches  to  some  of  the  fundamental  issues.  Section  3  gives  a  summary  of  parallel  computer  architec¬ 
tures  but  the  rest  of  the  paper  focuses  on  software  aspects  of  parallel  programming.  Section  4  gives  a 
high-level  discussion  of  parallel  programming  where  the  most  important  issues  are  identified  and  dis¬ 
cussed.  The  following  sections  focus  on  each  particular  aspect  of  parallel  programming  and  present  some  of 
the  most  recent  approaches  to  solving  each  of  them.  More  attention  is  placed  on  parallel  languages,  paral¬ 
lelising  compilers,  and  scheduling. 

In  particular,  Section  5  highlights  the  main  concurrency  features  of  a  number  of  parallel  program¬ 
ming  languages  and  systems.  Section  6  discusses  parallel  programming  through  automatic  program  res¬ 
tructuring,  and  gives  examples  of  the  application  of  a  few  representative  transformations.  Section  7  goes 
into  more  detail  with  parallelism  exploitation  issues  such  as  partitioning,  synchronisation,  and  scheduling. 
Different  approaches  to  the  general  scheduling  problem  are  considered,  and  loop  scheduling  is  examined  in 
more  detail. 

Section  8  gives  an  overview  of  our  approach  to  some  of  the  above  issues  in  the  context  of  the 
Parafrase-2  project.  A  novel  and  comprehensive  environment  for  automating  the  parallelism  exploitation 
phase  is  discussed.  Section  9  considers  multiprocessor  operating  system  issues,  and  provides  a  multiproces¬ 
sor  operating  system  taxonomy  based  on  three  important  functionalities.  Finally,  some  concluding 
remarks  are  given  in  Section  10. 

3.  Parallel  Computer  Architectures 

In  order  to  identify  the  class  of  computer  architectures  which  are  the  subject  of  this  discussion,  we 
shall  briefly  review  and  classify  computer  architectures  which  support  some  sort  of  concurrent  or  parallel 
processing  in  the  context  mentioned  above.  A  typical  serial  computer  is  based  on  the  principle  of  control 
flow  where  instructions  are  executed  one  at  a  time  in  a  predefined  order;  each  program  is  viewed  as  a  single 
instruction  stream.  A  program  counter  points  to  the  current  instruction  to  be  executed.  Instruction  execu¬ 
tion  can  be  accomplished  in  four  stages:  instruction  decoding,  fetching  of  operands,  execution,  and  storing 
of  results.  Since  these  activities  can  be  functionally  independent,  a  new  instruction  could  be  initiated  as 
soon  as  the  current  instruction  exits  the  first  phase.  Thus  four  instructions  can  be  active  in  their  execution 
cycle,  each  passing  through  a  different  phase.  This  technique  known  as  pipelining  was  realised  in  early 
computers  and  provided  limited  parallelism;  asymptotically,  four  instructions  could  be  executed  in  the 
same  time  it  would  have  taken  one  instruction  to  execute  without  pipelining.  The  idea  of  pipelining  was 
not  used  only  in  the  context  of  the  instruction  execution  cycle,  but  also  in  a  number  of  different  activities 
during  program  execution. 

A  straightforward  (and  oversimplified)  extension  to  the  single  program  counter  computer  is  to 
extract  several  independent  instruction  streams  from  a  single  program,  and  process  each  of  them 
separately  on  a  different  conventional  computer.  This  is  the  main  idea  upon  which  most  modern  parallel 
computers  are  based.  A  number  of  stand-alone  processing  elements  are  interconnected  together  (for  "occa¬ 
sional"  communication)  but  each  processing  element  remains  (in  its  control  structure)  a  conventional  com¬ 
puter.  A  radically  different  model  of  computing  machine  is  the  dataflow  model.  The  notion  of  a  program 
counter  does  not  exist  in  the  dataflow  model;  instead,  an  instruction  is  executed  as  soon  as  its  operands  are 
available  (ArNi87j.  Although  the  dataflow  principle  is  particularly  attractive,  its  realisation  is  less  attrac¬ 
tive.  On  a  finite  sise  dataflow  machine  we  still  have  to  face  problems  which  are  present  in  the  control  flow 
model  [GPKK82].  Since  more  instructions  can  be  ready  to  fire  than  execution  units  available,  one  needs  to 
decide  on  some  order  of  execution.  This  order  will  have  a  profound  effect  on  when  other  instructions 
become  ready  to  execute.  One  also  needs  to  decide  on  which  unit  a  particular  instruction  should  execute. 
These  are  only  some  of  the  problems  which  make  the  realisation  of  a  dataflow  system  more  complex  and 
costly  than  an  equally  powerful  control  flow  system. 
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A  number  of  dataflow  computer*  bare  been  built  so  far  with  the  most  recent  one  being  the  Sigma-1 
of  ETL  [HiSN84],  [TYUY86].  Many  of  these  borrowed  principles  from  the  conventional  ron  Neumann 
architecture  but  none  became  a  competitive  and  cost-effective  product.  The  idea  of  dataflow  can  be 
applied  more  successfully  at  different  levels  of  program  execution  if  combined  with  the  control  Sow  model. 
For  the  rest  of  this  presentation  we  focus  on  the  more  traditional  control  Sow  model  upon  which  is  based 
the  great  majority  of  existing  parallel  machines. 

There  are  three  basic  approaches  to  parallel  processing.  In  the  first  scenario,  parallelism  is  exploited 
by  executing  many  instances  of  the  same  instruction (s)  on  different  sets  of  data.  For  example,  for  the  mul¬ 
tiplication  of  two  vectors  of  site  n  we  can  execute  concurrently  n  multiplication  instructions,  each  operat¬ 
ing  on  a  different  pair  of  vector  elements.  Thns  in  principle  one  would  multiply  two  vectors  in  the  same 
amount  of  time  required  to  multiply  two  scalars.  This  model  of  parallel  computation  is  often  called 
Single  -InstructionMultiple-D  ata  or  SIMD  and  architectures  that  support  this  model  are  termed  SIMD 
architectures.  The  pipeline  model  is  clearly  SIMD.  From  the  control  structure  point  of  view,  SIMD 
machines  have  a  single  control  unit  which  is  responsible  for  instruction  issuing  and  execution.  Common 
realisations  of  SIMD  computers  are  vector,  pipelined,  and  array  machines.  Such  computer  systems  include 
the  STARAN,  the  Hliac  IV,  the  ICL-DAP,  the  Goodyear  MPP,  the  Borroughs  BSP,  the  CDC  Cyber  205, 
the  Cray-1,  the  Fujitsu  VP-100/200,  Hitachi  S-810,  NEC  SX-2,  Convex-1,  and  the  Connection  machine, 
just  to  mention  a  few. 

The  second  class  of  parallel  machines  are  those  that  can  process  independent  instruction  streams 
(each  operating  on  its  own  data  set)  simultaneously  and  are  therefore  termed  Mttltiple-Initruction- 
Multiple-Data  or  MIMD  systems.  MIMD  systems  are  usually  composed  of  a  number  of  independent 
stand-alone  processing  elements.  It  is  very  rear  however  to  extract  multiple  instruction  streams  which  are 
completely  independent  from  real  applications.  Some  control  and  data  information  exchange  between 
different  streams  is  usually  necessary.  To  realise  this  communication,  processing  elements  in  a  MIMD  sys¬ 
tem  need  to  be  able  to  communicate.  For  applications  where  this  information  exchange  is  very  infrequent, 
the  best  processor  interconnection  scheme  is  probably  a  point-to-point  connection.  Under  this  scheme  each 
processor  operates  ordinarily  out  of  its  private  memory.  Processors  which  need  to  communicate  assemble 
packets  of  information  called  messages  and  they  forward  them  to  the  requesting  processor.  This  mode  of 
communication  is  known  as  menage-pasting  and  computer  systems  supporting  message— passing  are  usu¬ 
ally  designed  as  distributed  memory  machines.  For  applications  where  this  information  exchange  between 
different  processor  is  very  frequent,  a  better  way  of  communication  is  through  a  shared  memory  where 
different  processors  can  read  and  write  data  to  a  common  memory  location.  Thus  communication  is 
achieved  through  sharing  of  memory;  a  processor  writes  into  a  particular  memory  location  and  another 
processor  reads  from  the  same  location.  Parallel  machines  based  on  this  architecture  are  known  as  shared 
memory  multiprocessors  or  parallel  computers.  Shared  memory  is  the  predominant  architecture  in  modern 
parallel  computers  and  is  the  underlying  architecture  model  for  most  of  the  discussion  that  follows.  Table 
1  gives  a  summary  (by  no  means  exhaustive)  of  shared  and  distributed  memory  parallel  computers 
[GeAG88],  [Hwan87],  Although  some  of  the  software  aspects  discussed  in  this  paper  are  particularly  suited 
for  shared  memory  multiprocessors,  most  of  the  ideas  are  applicable  to  both  machine  architectures. 

4.  Parallel  Programming  Issues 

In  this  section  and  for  the  remaining  of  this  paper  we  will  concentrate  on  the  programming  aspects  of 
parallel  computers  ranging  from  supercomputers  to  minisupercomputers.  As  mentioned  earlier,  parallel 
computers  have  evolved  impressively  in  the  last  few  years  from  the  architecture  and  the  hardware  point  of 
view.  Progress  on  the  software  side  can  be  best  characterised  as  moderate. 

There  are  many  reasons  for  the  slow  spread  of  parallel  programming.  Parallel  programming  is  still 
an  art  at  its  infancy,  and  as  such,  it  lacks  standards  and  software  tools  for  parallel  program  development. 
Most  programmers  are  accustom  to  traditional  serial  programming.  They  have  in  their  disposal  a  plethora 
of  programming  languages,  editors,  compilers,  libraries,  debuggers  and  numerous  other  tools  that  make 
programming  in  any  preferred  style  an  easy  task.  More  importantly,  many  of  these  tools  and  environments 
are  standards  which  make  codes  portable  between  a  large  variety  of  sequential  computers.  When  it  comes 
to  parallel  programming,  none  of  the  above  holds  true. 
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Table  2 

PARALLEL  PROGRAMMING  ISSUES 

1 

SPECIFICATION 

EXPLOITATION 

SUPPORT 

OF  PARALLELISM 

OF  PARALLELISM 

ENVIRONMENTS 
& TOOLS 

•  Language  constructs  for 

•  Load  balancing  and  low 

•  Tools  for  debugging 

expressing  and  packaging 
data/ functional  parallelism 

run-time  overhead  simultaneously. 

and  tracing  nondeterminancy. 

at  all  granularity  levels. 

•  Fast  synchronisation/communicat. 

•  Program  profiling  and 
static  performance  analysis. 

•  Constructs  for  synchronisation 

•  Dynamic  selection  of  granularity 

and  communication. 

of  parallel  tasks  within 

*  Graphical  interactive 

well-defined  bounds. 

user  interfaces. 

•  Means  for  expressing  arbitrary 

•  Low  overhead  process 

nesting,  repetitive  structures 

creation  and  management. 

•  Optimisation  and 

and  networks. 

•  Minimal  OS  involvement. 

performance  data  bases. 

•  Task/process  abstraction 

•  Numerical  stability 

from  architecture  details. 

.  Intra/interprocedural  dependence 
analysis  and  powerful 

analysers. 

•  Means  for  avoiding  and/or 
resolving  nondeterminancy. 

parallelising  compilers. 

.  Expert  systems. 

•  Distributed 

•  Adaptive  operating 

data  structures. 

systems. 

*  Hardware  support 

for  synchronisation,  context 

switching,  proc.  allocation. 

In  an  attempt  to  identify,  examine,  and  propose  alternative  solutions  to  the  crucial  issues  of  parallel 
programming,  we  start  by  dividing  the  aspects  of  parallel  programming  into  three  well-defined  phases. 
The  first  phase  is  that  of  Parallelism  Specification ,  and  includes  issues  involved  in,  and  means  for  express¬ 
ing  parallelism  in  algorithms  and  programs.  The  second  phase  groups  together  issues  involved  in  the 
Exploitation  of  Parallelism,  with  performance  being  the  major  underlying  factor.  Finally,  the  third  phase 
involves  Support  Environments  and  Tools  which  provide  the  means  for  achieving  the  goals  targeted  by  the 
first  two  phases,  in  a  "user-friendly"  and  globally  efficient  manner.  Table  2  summarises  the  goals  and 
issues  involved  in  each  case.  In  the  rest  of  this  section  we  address  each  of  the  three  phases  separately  and 
for  each,  we  examine  the  underlying  goals  and  the  major  realisation  and  performance  issues. 

SPECIFICATION  OF  PARALLELISM:  The  first  phase  in  the  development  of  a  "parallel"  program 
involves  the  selection  or  design  of  a  suitable  parallel  algorithm  for  a  particular  application  and  parallel 
architecture.  The  problem  of  parallelism  specification  starts  at  this  point.  An  ideal  parallel  language 
would  provide  syntactic  constructs  which  make  parallelism  specification  possible  and  easy  at  all  granular¬ 
ity  levels.  Put  in  other  words,  the  language  should  provide  the  means  for  expressing  manma/  parallelism. 
By  maximal,  we  mean  explicit  parallel  coding  of  all  parallel  operations  in  a  given  algorithm. 

There  are  two  reasons  for  this:  the  complexity  of  parallelism  qualification,  and  portability .  Having  the 
programmer  decide  whether  parallelism  at  the  operation  level  is  more  appropriate  than  at  the  loop  or  sub¬ 
routine  level,  is  not  desirable  due  to  the  enormous  potential  complexity  of  this  task  (assuming  that  we 
refer  to  a  case  where  parallelism  is  abundant  at  all  levels).  On  the  other  hand,  qualifying  parallelism  dur¬ 
ing  the  writing  of  a  program  inevitably  implies  tuning  the  program  for  a  particular  machine.  Thus  porta¬ 
bility  (with  respect  to  performance)  between  different  parallel  architectures  (e.g.,  a  VLIW  and  a  parallel 
scalar  architecture)  is  difficult  to  achieve. 

A  parallel  language  which  supports  manual  programming  for  maximal  parallelism  alleviates  both  of 
these  problems  and  shifts  the  responsibility  of  parallelism  qualification  to  the  compiler.  There  is  little 
doubt  that  this  can  be  best  done  by  an  optimising  compiler.  In  fact,  maximally  parallel  programs  would 
make  dependence  analysis  a  much  easier  task  for  parallelising  compilers.  We  return  to  this  issue  later  in 
this  section  as  well  as  in  the  second  part  of  the  paper. 

Besides  syntactic  structures  for  coding  parallel  constructs,  parallel  languages  must  provide  means  for 
packaging  basic  structures  into  hierarchical  parallel  structures,  for  synchronising  accesses  to  shared  data  or 
means  for  communicating  data  between  different  parallel  tasks.  Memory  hierarchies  in  parallel  machines 
give  rise  to  the  need  for  language  attributes  for  classifying  data  as  private,  task-shared ,  or  global,  and  for 
dynamic  memory  allocation  of  temporary  structures. 

Parallel  languages  cannot  "mature"  before  the  parallel  programming  community  understands  deeper 
the  fundamental  principles  of  parallel  processing  and  builts  more  experience  in  programming  parallel 
machines.  The  present  lack  of  flexible  parallel  languages  can  be  partly  overcome  by  interactive  tools, 
which  compensate  for  the  lack  of  expressiveness  of  parallel  languages  through  program  annotations,  optim¬ 
ising  compilers,  debuggers,  performance  and  program  profilers,  etc.  As  these  integrated  interactive 
environments  become  more  powerful  we  shift  to  fully  automated  parallelism  specification  methods;  even 
the  parallelism  specification  task  is  taken  away  from  the  programmer. 

As  indicated  in  Table  2,  parallelisr..  specification  is  only  one  of  the  many  and  complex  aspects  of 
parallel  programming.  This  complexity  leaves  little  doubt  about  the  necessity  of  interactive  parallel  pro¬ 
gramming  environments,  which  encompass  all  of  the  above  mentioned  tools  and  more.  The  central  tool  in 
such  an  environment  would  necessarily  be  a  parallelising  compiler.  Such  a  compiler  could  automatically 
parallelise  programs  written  in  a  serial  language,  as  well  as  verify,  enhance ,  and  qualify  the  parallelism  of 
already  parallel  programs.  One  would  then  be  attempted  to  ask:  why  then  the  need  for  parallel  languages? 
This  ii.  a  question  without  many  obvious  answers  which  has  drawn  considerable  debate.  Given  the  present 
limitations  of  parallelising  compilers,  one  could  argue  that  manual  parallel  programming  woult'  produce 
better  quality  parallel  code  than  such  a  compiler.  An  important  issue  for  parallel  languages  is  that  of 
parallel  algorithm  design.  Manual  parallel  programming  will  force  programmers  to  "think  in  parallel"  and 
will  expedite  research  on  parallel  algorithm  design. 
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In  summary,  parallelism  specification  can  be  done  manually  by  means  of  a  parallel  programming 
language,  automatically  through  a  restructuring  (parallelising)  compiler,  or  through  a  combination  of  the 
above.  Section  5  discusses  parallelism  specification  from  the  programming  languages  perspective.  Having 
parallelism  specified  externally  (at  the  source  level)  or  internally  (at  the  intermediate  representation  level 
through  a  compiler),  the  next  step  is  the  exploitation  or  mapping  of  parallelism  onto  a  parallel  computer. 

EXPLOITATION  OF  PARALLELISM)  The  main  aspects  of  parallelism  exploitation  include 
qualification  of  parallelism,  packaging  or  partitioning,  and  scheduling  or  resource  (processor/memory)  allo¬ 
cation.  These  are  only  the  "abstract"  aspects  of  the  parallelism  exploitation  phase.  When  parallelism  is  to 
be  fully  exploited  (i.e.,  minimise  run-time  overheads  and  balance  loads  across  processors),  then  one  needs 
to  address  compiler,  operating  system,  as  well  as  hardware  issues  as  discussed  below.  A  partial  set  of  prac¬ 
tical  approaches  to  the  parallelism  exploitation  problem  includes  [Beck89],  [Bokh88],  [CGMW88],  [Cof[76], 
[Fox87],  [Gokh87],  [Gupt89],  [Jaya88],  [KaNa84],  [Mann84],  [PoKu87]. 

We  assume  that  parallelism  specification  is  done  by  means  of  a  programming  language  which  sup¬ 
ports  maximal  expression  of  parallelism,  or  by  means  of  automatic  program  restructuring.  Below  we 
attempt  to  justify  why  parallelism  exploitation  can  be  best  done  through  the  compiler  and/or  the  run-time 
system  (although  there  are  hardly  any  voices  against  this  approach). 

Qualification  of  paralleiiem  ( Compiler fi  Consider  a  maximally  parallel  program  which  to  to  be  compiled  for 
a  specific  parallel  machine.  In  general,  parallelism  in  such  a  program  can  be  present  at  several  different  lev¬ 
els  of  granularity.  For  example,  we  may  have  parallel  operations  within  single  statem  ts  and  vector  state¬ 
ments,  several  such  statements  composing  independent  basic  blocks,  independent  basic  jlocks  nested  inside 
parallel  loops,  independent  parallel  and/or  serial  loops,  independent  subroutine  calls,  and  other  higher 
level  objects  which  may  potentially  execute  in  parallel. 

Likewise,  when  it  comes  to  real  parallel  computers,  we  face  three  (not  necessarily  orthogonal)  ques¬ 
tions: 


1)  Since  parallel  machines  have  a  limited  number  of  computational  elements,  and 
assuming  we  have  "parallelism  explosion"  in  a  given  program,  one  needs  to  decide 
which  parallel  constructs  will  eventually  execute  in  parallel,  and  which  will  be 
ignored  (serialised). 

2)  Since  few  machines  support  parallelism  in  the  hardware  for  all  above  cases,  some 
parallel  constructs  need  to  be  serialised  or  restructured  to  other  constructs  sup¬ 
ported  by  a  specific  machine  (put  in  other  words,  parallelism  may  need  to  be  re¬ 
packaged).  As  an  example  consider  a  loop  which  has  been  coded  as  a  parallel/vector 
loop  in  a  source  language,  and  is  to  be  executed  on  a  vector  computer  supporting 
multidimensional  vector  statements. 

3)  Overhead:  Exploiting  parallelism  in  the  hardware  does  not  come  for  free.  The 
familiar  pipeline  set-up  and/or  start-up  time  is  overhead  paid  for  vector  instruc¬ 
tions.  Compensation  code  is  the  overhead  paid  for  supporting  VLIW  instructions  in 
a  viable  manner  [Nico84].  Process  creation  time,  queue  access  and  scheduling,  and 
synchronisation-communication  time  are  overheads  associated  with  MIMD  parallel¬ 
ism  [Poly88].  Due  to  these  overheads  parallel  execution  of  certain  constructs  may 
result  in  execution  times  longer  than  the  corresponding  serial  ones. 

The  parallelism  qualification  phase  is  responsible  for  optimizing  the  code  with  respect  to  each  of  these 
scenarios. 

Packaging  or  partitioning  (Compiler):  Partitioning  refers  to  the  process  of  merging  or  splitting  well-defined 
units  of  computation  into  larger  or  smaller  ones  respectively.  In  order  to  maintain  consistency  in  our  terms 
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we  define  e  process  to  be  a  unit  of  computation  which  if  scheduled  me  a  whole  and  is  executed  under  the 
control  of  a  single  program  counter  (PC).  Hence  a  process  may  be  serial  code,  or  SIMD  code  (vector,  VLIW 
instructions  etc.).  A  task  is  composed  of  one  or  more  processes  but  it  is  treated  as  a  unit  by  the  compiler. 
In  simple  terms,  partitioning  decides  whether  sets  of  tasks  may  potentially  execute  in  parallel.  Merging 
dependent  tasks  is  appropriate  if  their  parallel  execution  demands  excessive  synchronisation  or  communi¬ 
cation  between  those  tasks  during  execution.  Splitting  a  large  task  into  smaller  ones  is  a  way  of  reducing 
the  granularity  and  hence  increase  the  exploitable  parallelism.  Partitioning  is  discussed  in  further  detail  in 
Section  7.1. 

Scheduling  ( Compiler /Run-lime  tyatem/OS ):  Scheduling  of  the  independent  processes  or  tasks  of  a  pro¬ 
gram  on  different  processors  is  an  activity  which  may  affect  execution  performance  dramatically.  Depend¬ 
ing  on  whether  scheduling  is  performed  statically  (at  compile/load— time)  or  dynamically  (at  run-time)  it 
may  or  may  not  be  separated  from  the  partitioning  phase.  Typically,  there  are  two  level  of  scheduling  in  a 
multiprocessor  system.  Scheduling  at  the  job  level  where  processor  resources  ,ed  to  be  allocated  fairly 
among  several  users  to  accomplish  multiprogramming,  and  scheduling  withing  a  job  in  order  to  distribute 
independent  tasks  to  different  processors  to  accomplish  multiprocessing  [Poly89],  Scheduling  is  the  subject 
of  Section  7.3. 

Hardware  eupport  The  overhead  involved  in  parallel  program  execution  can  be  reduced  through  several 
software  and  hardware  approaches.  Hardware  support  for  synchronisation,  context  switching,  queueing, 
interprocessor  communication,  and  processor  allocation  is  necessary  for  very  high  performance  parallel 
computers.  As  parallel  processing  overhead  keeps  decreasing,  parallelism  exploitation  at  lower  granularity 
levels  becomes  effective.  Most  of  the  modern  parallel  computers  incorporate  specialised  hardware  for  sup¬ 
porting  these  operations.  For  instance,  Alliant  FX/8  uses  a  control  bus  for  realising  fast  microtasking;  a 
set  of  synchronisation  registers  in  the  Cray  X-MP  and  Y-MP  is  used  in  a  similar  fashion;  a  large  register 
file  in  the  Convex  C-240  is  used  for  fast  context  switching  and  micro/macrotasking  [CGMW88], 
Hardware  support  for  such  special  operations  as  barrier  synchronisation  may  improve  the  performance  of 
parallel  tasks  significantly  [Alli85],  [Beck89j,  [Gupt89], 

SUPPORT  ENVIRONMENTS  &  TOOLS)  The  issues  involved  in  the  parallelism  specification  and 
exploitation  phases  are  many  and  often  overwhelmingly  complex  for  any  programmer.  On  the  other  end  it 
is  often  impossible  or  difficult  for  the  compiler  to  gather  all  the  information  necessary  to  carry  out  these 
optimisations.  Thus,  an  interaction  between  the  user  and  the  compiler  is  mandatory  for  achieving  the  best 
result  in  general. 

User  intervention  may  happen  at  several  different  levels.  A  parallel  programming  environment  must 
provide  the  appropriate  user  interface  to  allow  for  easy  and  effective  interaction  with  the  user  (Figure  I). 
Such  an  interface  should  provide  the  tools  for  representing  a  program  at  many  different  levels,  starting 
from  the  source  level  to  the  dependence  graph,  and  up  to  the  task  or  subroutine  call  giaph  level.  User  feed¬ 
back  can  be  in  the  form  of  predicates  which  specify  the  relationship  between  atomic  structures,  or  value- 
ranges  for  variables,  in  the  form  of  task  qualifiers  etc. 

In  addition,  a  programming  environment  must  support  parallel  program  debugging  and  tuning.  Non¬ 
determinism  and  other  race  conditions  in  a  multiprocessor  make  parallel  program  debugging  a  particularly 
critical  issue.  Post-mordem  analysis  and  performance  profilers  are  also  needed  to  allow  further  tuning  for 
enhancing  performance.  Numerical  stability  analysers  can  be  useful  in  determining  the  effect  of  program 
restructuring  on  the  stability  of  computations  (BrGa89j.  Static  timing  analysis  of  a  parallel  program  is 
necessary  to  guide  many  phases  of  a  parallelising  compiler  |Cytr84],  [Poly88|,  (SBDN87).  For  example,  the 
parallelism  qualification  phase  is  based  on  compile-time  estimates  of  code  and/or  data  structure  sise,  exe¬ 
cution  time  estimates,  etc. 

Programming  environments  will  allow  a  user  to  display  segments  of  a  program  in  different  represen¬ 
tations  ranging  from  the  code  itself  to  dependence  and  control  Sow  graphs  to  task  graphs.  In  addition,  a 
profiler  can  give  information  as  to  where  a  program  spends  most  of  its  time,  or  detailed  timing  profile  for 
an  entire  application.  A  library  of  different  synchronisation  schemes  provides  synchronisation  alternatives 
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Figure  X.  A  parallel  programming  environment. 
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for  different  circumstances.  The  most  appropriate  type  of  synchronisation  for  each  circumstance  can  then 
be  chosen  by  the  user  of  automatically  by  the  compiler.  A  number  of  other  libraries,  data  bases,  and  other 
program  manipulation  tools  complete  the  picture  of  a  parallel  programming  environment  which  lies  on  top 
of  a  powerful  parallelising  compiler.  It  will  not  be  long  before  such  programming  environments  become 
common. 

Figure  1  shows  the  different  phases  of  l  complete  programming  environment.  The  same  diagram  can 
be  viewed  as  a  parallel  program  development  and  execution  cycle.  Each  of  the  boxes  in  the  figure 
correspond  to  a  number  of  tools  mat  perform  a  specific  function.  For  example,  the  "program  restructuring" 
box  includes  tools  such  as  a  parallelising  compiler,  static  performance  analyser,  program  data-base  for 
interprocedural  analysis,  code  generation,  and  possibly  profile  history,  graphical  interfaces  for  displaying 
program  representations,  etc.  All  these  tools  must  be  integrated  and  interfaced  in  a  convenient  manner  to 
provide  for  maximum  user  productivity  and  program  performance,  which  is  the  end-goal.  An  expert  sys¬ 
tem  may  be  built  on  top  of  such  an  environment  to  provide  user  support  as  well  as  to  guide  several  pro¬ 
gram  optimisations  such  as  multi-version  code  generation  [WaGa89],  [Wolf89],  It  is  worth  noting  that  at 
present,  there  exists  no  programming  environment  that  supports  automation  of  even  a  small  subset  of  the 
functions  discussed  above. 

5.  Parallel  Languages  and  Systems 

Few  programming  languages  provide  adequate  support  for  parallel  programming.  Parallel  language 
design  is  by  and  large  at  an  experimental  stage.  One  of  the  earliest  and  most  influential  formalisms 
toward  a  parallel  programming  language  was  the  CSP  proposed  by  Hoare  [Hoar78].  CSP  is  the  foundation 
upon  which  Occam  was  built  for  transputer  based  networks,  and  it  influenced  the  design  of  other 
languages  and  systems  as  well  (e.g.,  Linda,  [ShCG86]). 

At  present,  manual  parallel  programming  is  accomplished  in  three  different  ways:  through  the  use  of 
"parallel"  languages,  through  program  annotations,  or  through  a  combination  of  language  constructs,  pro¬ 
gram  annotations,  and  tools  that  work  between  the  compiler  and  the  run-time  or  the  operating  system, 
and  provide  virtual  processors  that  can  be  user  controlled.  Early  multiprocessors  provided  parallel  pro¬ 
gramming  support  via  annotations  in  the  form  of  comment  cards  or  function  calls.  These  annotations  tap 
to  the  run-time  library  and  eventually  to  the  operating  system  to  provide  the  means  for  creating  multiple 
tasks,  allocating  memory  for  them,  and  scheduling  them  on  the  computing  elements  of  a  multiprocessor. 
The  first  version  of  the  Cray  multitasking  library  worked  in  a  similar  way  [Cray 85]. 

Following  similar  approaches  most  multiprocessor  vendors  provided  extended  run-time  systems  with 
parallel  processing  support,  typically  referred  to  as  multi/macro/micro-tasking  libraries.  Run-time  sys¬ 
tems,  however,  do  not  provide  a  portable  environment  for  parallel  programming  -  they  are  tailored  to  a 
particular  machine.  Portability  was  the  major  force  behind  the  development  of  the  Schedule  package  at 
Argonne,  which  allows  the  user  to  explicitly  specify  data  dependences,  partitioning,  and  allocation  of  tasks 
to  virtual  processors  supported  by  that  environment  [DoSo87]. 

A  combination  of  language  extensions  and  run-time  systems  is  being  currently  used  by  most  super 
and  superminicomputer  vendors,  and  many  more  have  been  developed  at  universities  and  other  research 
labs.  An  example  of  such  a  system  is  the  Linda  environment  [AhCG88).  At  different  times  Linda  has  been 
suggested  as  a  language,  as  a  run-time  support  system,  and  even  as  an  environment  for  parallel  program¬ 
ming.  Based  on  CSP  principles,  Linda  provides  convenient  (but  not  necessarily  efficient)  constructs  for 
sharing  and  communicating  data  between  different  processors.  Like  Linda  the  great  majority  of  these  sys¬ 
tems  provide  a  convenient  and  portable  environment  for  parallel  program  development,  but  they  are, 
nevertheless,  quite  restrictive  when  it  comes  to  types  of  parallel  constructs  supported,  scheduling,  syn¬ 
chronisation,  and  above  all  efficiency  and  performance.  Furthermore,  none  of  these  systems  provides 
automatic  solution  to  the  crucial  issues  of  parallel  programming  mentioned  in  the  previous  section.  They 
do  however,  represent  the  state-of-the-art  in  parallel  programming  support  environments 

In  the  rest  of  this  section,  we  shall  review  the  most  interesting  extensions  to  widely  used  languages 
for  supporting  parallelism.  Language  extensions  have  been  proposed  for  a  variety  of  existing  serial 
languages,  both  functional  and  imperative.  User  enthusiasm  has  varied  and  as  it  would  have  been  expected 


195 


it  depends  on  the  popularity  of  the  hue  serial  language  in  the  first  place.  Table  3  gives  a  summary  of 
existing  parallel  Lisp  implementations  for  a  variety  of  parallel  architectures,  including  datafiow  machines 
[PaHK88].  Perhaps  the  most  well-known  Lisp  dialect  is  Multilisp  (based  on  Scheme)  which  supports 
parallelism  through  the  future  construct  [Hals88]. 

At  the  language  level,  futures  provide  the  means  for  decoupling  references  to  a  variable  from  the 
evaluation  of  that  variable  or  structure.  A  future  associated  with  a  variable  can  be  referenced  before  that 
variable  has  been  evaluated;  in  that  case  the  execution  of  the  object  making  the  reference  is  blocked  until 
the  evaluation  is  complete.  Thus  futures  provide  a  vehicle  for  synchronising  accesses  to  atomic  and  com¬ 
pound  structures.  Although  futures  provide  a  flexible  and  powerful  abstraction,  when  it  comes  to  perfor¬ 
mance,  *he  real  issue  lies  in  the  implementation  and  efficiency  of  this  abstraction. 

Similar  language  extensions  have  been  proposed  for  Prolog,  with  Concurrent  Prolog  being  the  most 
well-known  parallel  dialect  [Shap88j.  Table  4  gives  a  summary  of  parallel  prolog  systems  currently  in  use 
[PaHK88j.  In  the  next  few  sections  we  look  in  more  detail  at  parallel  Fortran  languages,  Fortran  being  by 
and  large  the  predominant  language  for  programming  high  performance  multiprocessor  systems.  Occam  is 
reviewed  next  as  the  representative  of  languages  which  were  developed  originally  as  parallel  programming 
languages. 

5.1.  Parallel  Fortran  Dialects 

It  is  not  random  that  most  of  the  "parallel''  languages  in  use  today  are  extensions  and  enhancements 
to  Fortran.  Almost  all  supercomputer  vendors  supply  an  enhanced  (vector  and/or  parallel)  Fortran  version 
with  their  machines.  In  this  section  we  overview  the  most  important  features  of  various  Fortran  dialects. 

Fortran  8Xi  The  only  notable  Fortran  8X  features  for  "parallel"  programming  are  the  array  operations 
and  statements  [Lawr75j,  [Paul82j.  Nevertheless,  Fortran  8X  is  evolving  in  the  right  direction  as  a  more 
flexible  and  general  purpose  language  by  including  facilities  for  complex  user-defined  data  types,  recursion, 
more  control  constructs  etc  [ANSI88).  The  new  standard  is  a  superset  of  Fortran  77  even  though  several 
old  and  redundant  features  have  become  candidates  for  elimination  in  the  next  standard.  In  addition  to 
five  types  of  intrinsic  literal  constants  and  scalar  variables,  8X  provides  facilities  for  user-defined  data 
types.  General  type  declarations  hare  the  following  format: 

[access]  TYPE  type-name  [  (type-param-name-liet) ) 
type-epeeifieation 
END  TYPE  [type-name] 

where  access  can  be  PUBLIC  or  PRIVATE  and  type-epeeifieation  is  a  sequence  of  intrinsic  and/or  other 
type  declarations.  An  example  of  a  type  declaration  is  shown  below. 

TYPE  date 

CHARACTER  (LEN=7)  day 
CHARACTER (LEN=10)  month 
INTEGER  year 
END  TYPE  date 

The  declaration  and  use  of  complex  data  objects  in  Fortran  8X  resembles  that  of  records  in  Pascal.  The 
language  provides  for  array  declarations  of  fixed  and  variable  sites.  For  instance  the  statement 

REAL,  ARRAY (-2:5,  10)  : :  A 

declares  A  to  be  a  two  dimensional  array  (rank  two)  of  reals  with  eight  rows  and  10  columns.  Vector  con¬ 
stants  can  be  specified  as  lists  of  elements  enclosed  in  square  brackets.  The  sequence  [5,5,5,7  8,8]  is 
a  vector  constant  of  site  seven  and  rank  one. 

In  terms  of  parallel  constructs  there  is  not  much  more  than  array  expressions  and  assignment  state¬ 
ments.  Array  expressions  and  assignments  are  allowed  between  conformable  arrays  of  compatible  types.  If 
A .  B ,  C,  and  V  are  declared  as  follows, 
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REAL.  ARRAY (10, 10)  ::  A, B,C 
REAL,  ARRAY (5)  ::  V 

then  C  =  A+B  defines  the  elements  of  C  to  be  the  sum  of  the  corresponding  elements  of  A  snd  B.  The 
correspondence  is  by  position  in  the  extend  (dimension).  Thus,  the  same  array  assignment  statement  can 
be  rewritten  as  C(l:10)  =  A (1 : 10)  +B  (1 : 10) .  Similarly,  A  =  A+ 5  increments  all  elements  of  A  by 
fire.  The  statement  A  (1 , 1 : 5)  —  V  results  in  orerwiiting  the  first  fire  elements  of  the  first  row  of  A  by 
the  corresponding  elements  of  rector  V. 

Finally,  modules  in  8X  along  with  the  PUBLIC  and  PRIVATE  attributes  on  type  declarations  allow 
for  packaging  and  restricted  use  of  data  objects  at  different  scopes  of  a  program.  Although  Fortran  8X  has 
not  been  frosen  at  present,  it  is  fair  to  say  that  the  language  is  erolving  as  a  more  general  purpose 
language  by  proriding  the  programmer  with  facilities  for  data  abstraction,  more  efficient  memory  alloca¬ 
tion,  structured  programming  and  more  powerful  control  constructs.  Nerertheless,  it  provides  minimal 
support  for  concurrent  programming. 

CEDAR  and  Cray  Fortran)  The  Cedar  Fortran  was  designed  based  on  the  8X,  Alliant,  and  Cray  For¬ 
tran  extensions,  and  includes  a  number  of  additional  extensions  for  multiprocessing  support  on  the  Cedar 
machine  [GPHL88],  (KDLS86].  Given  the  clustered  organisation  of  the  Cedar  multiprocessor,  and  the 
desire  to  provide  more  control  to  the  user,  Cedar  Fortran  supports  two  types  of  concurrent  loops:  CDOALL 
for  intra-cluster  concurrency,  and  SDOALL  for  inter-cluster  concurrency.  Consider  the  following  doubly 
nested  parallel  loop. 

CLOBAL  A  (10.  20)  ,  I 
SDOALL  1=1.  10 
INTEGER  J 

LOOP 

CDOALL  J  =  1.  20 
A(I.J)  =  I+J 
ENDCDOALL 
ENDSDOALL 

Each  iteration  of  the  outer  loop  is  assigned  to  a  different  processor  cluster  of  the  machine.  Within  each 
iteration  of  the  SDOALL  we  have  another  concurrent  loop  which  is  executed  within  the  cluster  owning  its 
current  I  value.  Since  A  is  written  by  different  clusters,  it  is  allocated  in  global  memory.  In  addition  to 
GLOBAL  Cedar  Fortran  provides  the  CLUSTER  memory  attribute,  for  structures  allocatable  to  cluster 
memory.  In  the  above  example,  integer  J  is  duster-private.  Notice  that  since  vector  instructions  are  sup¬ 
ported  by  the  Cedar  (Alliant)  processors,  the  machine  supports  up  to  three  levels  of  parallelism. 

Cedar  locks  and  events  are  identical  to  those  of  Cray  Fortran.  The  same  is  true  for  macrotasking 
primitives.  A  new  task  can  be  spawn  by  executing  the  following  statement:  lnt  = 
ctskstart  (num_proc.  sub  [,arg]  ...),  where  sub  is  the  subroutine  name  containing  the  task 
with  an  optional  list  of  arguments,  and  num_proc  is  the  number  of  processors  requested  for  that  task, 
logical  =  ctskdone  (lnt)  returns  true/false  depending  on  whether  task  Inc  has  completed  or  not. 
A  join  or  barrier  synchronisation  can  be  accomplished  through  call  ccskvait  (lnt)  which  suspends 
execution  of  the  calling  routine  until  task  lnt  completes  execution. 

Both  the  Cray  and  Cedar  dialects  include  extensive  support  for  macro  and  microtasking,  synchroni¬ 
sation,  and  dynamic  memory  allocation.  Most  of  the  Cedar  Fortran  facilities  resemble  closely  those  of 
Cray’s.  Both  languages  support  multitasking  through  the  run-  time  system  ss  opposed  to  language  exten¬ 
sions.  Since  the  multitasking  mechanisms  are  essentially  the  same  as  those  of  the  more  recent  IBM  For¬ 
tran,  and  since  the  latter  provides  true  language  extensions  for  tasking,  we  have  chosen  not  to  discuss  task¬ 
ing  in  the  Cedar  and  Cray  Fortran,  in  favor  of  more  detailed  review  of  the  IBM  Fortran. 

IBM  Fortran)  Recently,  IBM  released  the  first  version  of  its  Parallel  Fortran  for  the  MVS/XA  and  the 
VM/XA  SP  operating  systems.  The  IBM  Parallel  Fortran  is  probably  the  most  "loaded"  Fortran  dialect 
for  multiprocessing  so  far  (IBM88j.  It  provides  both  direct  language  exti.  :ions  for  parallel  processing  as 
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well  a*  a  Mt  of  useful  library  routines  and  compiler  directive!.  The  combination  of  powerful  language 
extensions  along  with  (limited)  automatic  parallelisation  in  the  compiler  gives  the  IBM  Fortran  a  distinct 
advantage  over  many  other  Fortran  dialects.  For  parallel  task  execution  the  IBM  run-time  library  follows 
closely  Cray’s  macro  and  microtasking,  hnt  more  facilities  for  tasking  are  provided  at  the  language  level  in 
this  case. 

In  the  IBM  Fortran,  a  task  is  a  complete  environment  with  ita  own  local  space  and  code.  Tasks  can 
also  share  space  with  other  tasks.  Tasks  can  be  explicitly  created  and  manipulated  by  the  programmer. 
Tasking  in  the  IBM  Fortran  is  implemented  in  two  stages.  In  the  first  stage,  tasks  are  explicitly  created 
but  they  are  not  executable.  We  often  refer  to  originated  (but  not  allocated)  tasks  as  virtual  or  simply 
tasks.  In  the  next  phase  these  virtual  tasks  are  allocated  work  (parts  of  user  code),  again  explicitly 
through  facilities  provided  by  the  language.  We  call  this  the  binding  phase  and  we  refer  to  work  allocated 
to  virtual  tasks  as  user  or  real  tasks  or  simply  tasks  whenever  the  context  is  clear. 

In  addition  to  explicit  tasking,  implicit  taste  created  automatically  by  the  run-time  library  are  used 
to  execute  explicitly  coded  parallel  loops.  In  the  rest  of  this  section  we  concentrate  on  language  rather 
than  library  and  compiler  features.  As  mentioned  above,  the  user  creates  a  number  of  tasks  at  the  begin¬ 
ning  of  the  program  and  performs  the  binding  of  real  to  virtual  tasks.  In  addition  to  binding,  synchronisa¬ 
tion,  load  balancing,  and  in  general  the  entire  tasking  process  is  under  the  direct  control  of  the  program¬ 
mer.  But  let  us  review  the  most  interesting  features  of  the  language. 

First,  we  consider  parallel  loops.  Vector  processing  was  supported  by  the  previous  IBM  Fortran  ver¬ 
sion  for  the  3090  series,  and  the  syntax  of  vector  statements  is  identical  to  that  of  Fortran  8X.  The  new 
Fortran  extensions  include  a  parallel  loop  construct  whose  syntax  is  as  follows: 

PARALLEL  LOOP  late l  [ .  ]  indent,  u  [ ,  str] 

[PRIVATE  (nar  [.ear]...)] 

[DOFIRST  [LOCK] 
dofirst-block ] 

[DOEVERY 

docvcrp-block] 

[DOE INAL  [LOCK] 
dofinal-block] 

label  CONTINUE 

The  PARALLEL  LOOP  header  initiates  a  loop  which  is  to  be  executed  by  many  implicit  tasks.  Depending 
on  the  loop  type  and  sise  the  implicit  tasks  may  be  as  many  as  the  number  of  iterations  of  the  loop  (thus 
different  iterations  may  be  executing  simultaneously  and  independently),  or  many  iterations  may  be 
assigned  to  a  single  implicit  task.  Variables  which  are  private  with  respect  to  each  loop  iteration  may  be 
declared  as  such  within  the  loop,  following  the  PRIVATE  clause. 

The  body  of  a  parallel  loop  consists  of  a  prologue  which  starts  after  the  DOFIRST  clause  ( do  first - 
block),  the  main  loop  body  which  starts  after  the  DOEVERY  statement  ( docvcry-block ),  and  an  epilogue 
(dofinal-block)  which  starts  after  the  DOFINAL  statement.  The  DOFIRST  and  DOFINAL  blocks  are  exe¬ 
cuted  once  by  each  implicit  task  participating  in  the  execution  of  the  loop.  The  DOEVERY  block  contains 
the  code  which  is  ‘  be  executed  by  each  iteration  of  the  loop.  The  DOFIRST  and  DOFINAL  blocks  can 
be  executed  exc'  ely  (by  the  implicit  tasks  assigned  to  the  loop)  if  the  LOCK  attribute  is  specified.  Each 
task  obtains  the  lock  before  entering  the  corresponding  block,  and  releases  the  lock  upon  exit  from  that 
block.  An  example  of  a  parallel  loop  performing  a  reduction  follows. 

SUB  =  0 

PARALLEL  LOOP  10  1  =  l.N 
PRIVATE  (s) 

DOFIRST 
s  =  O 
DOEVERY 

DO  20  j  =1,M 
s  =  s  +  a  (1, J) 
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20  CONTINUE 

DOE INAL  LOCK 

sum  =  sum+s 

10  CONTINUE 

Another  important  feature  of  the  IBM  Fortran  is  the  parallel  CASE  statement  which  can  be  used  in  much 
the  same  way  as  cobegin-coend  to  achieve  parallel  execution  of  independent  computations.  The  syntax  of 
the  PARALLEL  CASES  statement  is  shown  below. 

PARALLEL  CASES 

[PRIVATE  (ear  [ ,  ear]  . . . )  ] 

{CASE  [m  [,  WAITING  FOR  {CASE  n  |  CASES  (ni  [,n2]...)}]] 
cate-block} . . . 

END  CASES 

All  different  cate-bloekt  are  evaluated  simultaneously  by  different  implicit  tasks.  If  the  WAITING  FOR 
attribute  is  specified  for  a  CASE,  that  eaie-block  will  not  execute  before  all  cases  referenced  in  that 
WAITING  FOR  have  completed  execution.  Therefore,  one  may  use  the  cases  statement  to  schedule  acyclic 
task  graphs  for  parallel  execution.  Consider  for  example  the  following  task  graph 


This  graph  may  be  coded  in  Fortran  using  the  parallel  case  statement  as  follows: 

PARALLEL  CASES 
CASE  1 
Tl; 

CASE  2,  WAITING  FOR  CASE  1 
T2: 

CASE  3,  WAITING  FOR  CASE  1 
T3; 

CASE  4,  WAITING  FOR  CASES  (2.3) 

T4; 

CASE  S,  WAITING  FOR  CASES  (3,4) 

TS: 

END  CASES 

Explicit  task  definition  and  manipulation  is  supported  by  direct  language  extensions.  Any  number  of  tasks 
may  be  started  anywhere  in  the  program  by  using  the  ORIGINATE  statement, 

ORIGINATE  ANY  TASK  tikid 

where  tikid  is  the  identifier  of  the  task  initiated  by  the  statement.  A  task  created  by  the  ORIGINATE 
statement  is  owned  by  the  task  from  which  the  statement  was  executed.  The  following  loop, 
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DO  10  i  =  1,10 

ORIGINATE  ANY  TASK  tskid(i) 

10  CONTINUE 

initiates  10  tasks  which  can  be  referenced  inside  the  same  scope  by  their  identifier  Ukid'*).  Upon  initiation, 
tasks  are  not  assigned  specific  work  unless  an  explicit  SCHEDULE  statement  is  executed.  The  SCHEDULE 
statement  binds  a  real  task  (user-defined)  to  an  already  originated  task. 

There  are  two  versions  of  the  SCHEDULE  command.  The  first  is  the  SCHEDULE  TASK  command 
with  the  following  syntax: 

SCHEDULE  TASK  Ukid, 

[TAGGING  ( tagl  [ ,  tagZ]  ...),] 

[SHARING  ( thrcom  [ ,  shreom]  ...),] 

[COPYING  (epcom  [ ,  epcom]  . .  . )  .  ] 

[COPYINGI  ( cpicom  [,  eptcom]  ...),] 

[COPYINGO  (epoeom  [,  cpocom]  .  .  .)  ,  ] 

CALLING  suiz[  ( [argl  [ ,  argS]  .  .  .  ] )  ] 

The  second  is  the  SCHEDULE  ANY  TASK  whose  syntax  (besides  the  clause  ANY)  is  identical  to  that  of  the 
SCHEDULE  TASK.  Let  us  use  the  abbreviations  ST  and  SAT  for  the  two  statements.  In  ST,  Ukid 
specifies  the  identifier  of  a  currently  unused  task  which  will  execute  the  scheduled  subroutine  (or  user  task). 
In  contrast,  SAT  returns  this  identifier  in  Ukid. 

The  argument(s)  in  the  TAGGING  attribute  is  a  scalar  value  which  is  used  to  tag  the  piece  of  work 
which  is  being  scheduled.  These  tags  can  be  used  to  identify  tasks  and  determine  their  status  (i.e,  com¬ 
pleted,  executing).  The  arguments  in  the  SHARING  clause  are  the  names  of  COMMON  blocks  which  are 
shared  with  the  scheduled  task.  The  COPYING  part  of  the  command  takes  as  argument  the  name  of  a 
COMMON  block  in  the  environment  of  the  scheduling  task.  The  contents  of  that  block  are  copied  into  a 
COMMON  block  of  the  same  name  but  which  is  created  in  the  scheduled  task’s  environment.  Upon  comple¬ 
tion  of  the  scheduled  task,  the  contents  of  the  latter  block  are  copied  back  into  the  former.  COPYINGI  is 
as  above  but  no  copying  back  is  performed  upon  completion  of  the  task.  The  argument(s)  in  COPYINGO 
is  the  name  of  a  common  block  in  the  environment  of  the  scheduled  task.  The  contents  of  this  block  are 
copied  back  to  a  COMMON  block  of  the  same  name  in  the  environment  of  the  scheduling  task  aft'-r  the  com¬ 
pletion  of  the  scheduled  task. 

Finally,  the  CALLING  part  of  the  statement  specifies  the  name  of  the  subroutine  to  be  scheduled  for 
execution  (subz).  Hence,  unlike  implicit  tasks  in  the  CASE  statement,  tasks  which  are  explicitly  scheduled 
with  the  SCHEDULE  command  must  be  organised  as  subroutines.  Another  I  iterating  feature  or  the 
SCHEDULE  command  is  that  it  allows  the  programmer  to  perform  manual  binding  of  program  tasks  to  vir¬ 
tual  tasks  (through  the  ST  version  of  the  command),  but  it  also  allows  more  dynamic  binding  through  the 
SAT  version.  In  both  cases,  however,  virtual  tasks  must  be  explicitly  originated  before  the  binding. 

Synchronisation  of  SCHEDULEd  tasks  can  be  achieved  through  the  use  of  the  WAIT  FOR  statement. 
Every  SCHEDULE  statement  in  a  program  must  have  a  corresponding  WAIT  FOR  statement.  The  latter 
forces  the  issuing  task  to  wait  until  the  corresponding  scheduled  task  completes.  There  are  three  versions 
of  the  WAIT  FOR  statement.  The  first, 

WAIT  FOR  TASK  Ukid  [ .  TAGGIGN  (tagl  [,  tagS]  .  .  .)] 
blocks  execution  of  the  issuing  task  until  the  task  with  identifier  Ukid  completes.  The  second  version  of  the 
command,  WAIT  FOR  ANY  TASK,  with  identical  syntax  as  the  first,  blocks  execution  until  any  task 
issued  by  the  scheduling  task  completes  execution.  Hence,  if  only  four  tasks  with  identifiers  TSKID  (1 :  4) 
are  SCHEDULEd  from  a  given  subprogram,  then  the  sequence 

WAIT  FOR  TASK  TSKID (1) 

WAIT  FOR  TASK  TSKID (2) 

WAIT  FOR  TASK  TSKID (3) 

WAIT  FOR  TASK  TSKID (4) 
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will  block  further  execution  until  ell  four  tuka  complete  end  so  does  the  sequence 
WAIT  FOR  ANY  TASK  Ukid 
WAIT  FOR  ANY  TASK  Ukid 
WAIT  FOR  ANY  TASK  ttkid 
WAIT  FOR  ANY  TASK  Ukid 

The  third  version  is  the  WAIT  FOR  ALL  TASKS  statement.  If  we  substitute  the  SCHEDULE  clause  in  the 
above  statement  with  the  DISPATCH  clause  we  get  another  statement  with  identical  syntax  and  similar 
semantics.  The  only  difference  is  that  a  task  which  has  been  assinged  work  through  a  DISPATCH  can  be 
reassigned  other  work  by  another  DISPATCH  without  using  a  corresponding  WAIT  FOR  statement.  How¬ 
ever,  that  task  cannot  be  reassinged  work  through  a  SCHEDULE  statement  unless  a  WAIT  FOR  ALL 
TASKS  has  been  issued.  The  latter  is  the  only  means  of  synchronising  dispatched  work.  Continuing  the 
previous  example,  a  single  WAIT  FOR  ALL  TASKS  statement  would  also  be  equivalent  to  the  two  WAIT 
FOR  sequences  above. 

In  addition  to  these  language  extensions,  the  IBM  Parallel  Fortran  provides  also  a  rich  repertoire  or 
intrinsics  and  library  routines  for  parallel  locks  and  events  as  well  as  a  set  of  compiler  directives  which 
facilitate  manual  and  semi-automatic  parallel  programming.  With  the  exception  of  parallel  loop  handling, 
the  IBM  Fortran  provides  the  most  complete  extensions  for  tasking  compared  to  other  parallel  Fortran 
dialects.  A  summary  of  other  vector  and  parallel  Fortran  dialects  appears  on  Table  5  [GPHL88]. 

Occami  As  a  representative  of  a  language  originally  designed  as  parallel  we  selected  Occam,  a  language 
which  has  been  specifically  designed  (based  on  CSP)  for  distributed  memory  computers.  The  fact  that 
Occam  is  one  of  the  first  attempts  to  design  a  genuine  message-based  language  makes  it  quite  interesting. 
Despite  the  almost  enthusiastic  acceptance  of  Occam  by  the  transputer  community,  the  language  suffers 
(from)  many  drawbacks.  Before  we  comment  on  the  .dvantages  and  the  peculiarities  of  Occam  let  us 
examine  the  basic  facilities. 

Unlike  the  languages  that  have  been  examined  so  far,  the  notion  of  global  or  shared  variables  and 
structures  in  Occam  does  not  exist.  Rather,  data  exchange  between  different  parts  of  a  program  occurs 
through  explicitly  defined  channels  and  is  completely  asynchronous.  Program  modules  are  organised  in 
processes  which  communicate  through  user-defined  channels.  Communication  is  achieved  with  messages 
which  are  assembled  by  a  transmitting  process  and  are  forwarded  through  channels  to  a  receiving  process. 
Thus,  variables  are  always  local  to  each  process. 

Occam  programs  can  be  written  using  three  primitive  processes  and  a  number  of  constructs  which 
provide  the  means  for  grouping  primitives  into  more  complex  program  units.  The  three  primitives  are 
assignment,  input,  and  output  statements.  An  assignment  statement  has  the  general  form  v  :=  expr 
and  assigns  the  value  of  expression  axpr  to  vari  Me  v.  I/O  is  always  accomplished  in  Occam  through 
channels.  Channels  can  be  declared  in  a  program  just  like  any  other  variable  using  the  attribute  CHAN.  As 
discussed  later  (virtual  or  program)  channels  can  be  mapped  directly  to  hardwired  channels  of  a  particular 
transputer  configuration.  If  chanl  and  chan  2  are  variables  of  type  CHAN,  then  the  primitive  process 
chanl  !  expr  outputs  the  value  of  expression  expr  to  channel  chanl.  An  input  process  of  the  form 
chan2  ?  v  inputs  the  value  of  variable  v  from  channel  chan2. 

Complex  nested  processes  can  be  formed  by  using  a  number  of  constructs  provided  by  the  language. 
Any  well  defined  program  unit  (starting  from  the  primitives)  is  a  process  with  is  own  instruction  stream 
and  data.  It  is  convenient  to  represent  Occam  programs  using  directed  graphs  with  nodes  representing 
processes  and  arcs  corresponding  to  program  defined  channels.  Since  exchange  of  data  and  other  comi  uni¬ 
cation  can  occur  only  through  channels,  two  or  more  Occam  processes  may  communicate  only  if  they  are 
simultaneously  active  (executing).  If  a  process  tries  to  communicate  with  another  process  whose  execution 
has  not  started  or  which  is  blocked,  then  the  first  process  also  becomes  blocked.  Thus,  it  is  not  hard  to 
specify  a  set  of  processes  which  can  lead  to  a  deadlock.  Another  peculiar  feature  of  Occam  is  that  primitive 
processes  must  be  written  in  a  single  line.  For  example,  breaking  a  lengthy  assignment  process  into  two 
lines  my  produce  incorrect  results.  Some  ad  hoc  rules  can  be  used  to  avoid  such  cases  but  these  rules  are 
not  enforced  by  most  Occam  implementations. 
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In  Occam  indentation  defines  the  scope  of  objects  and  constructs.  Variables  are  local  to  the  process 
immediately  following  their  declaration  and  at  the  same  indentation  level.  The  scope  of  constructs  includes 
all  processes  which  are  indented  two  spaces  to  the  right  of  the  construct  specifier.  Unlike  most  other 
languages,  lexical  ordering  does  not  relate  to  flow  of  control  order.  Occam  provides  a  number  of  flow  of 
control  constructs  which  are  discussed  below.  The  construct  SEQ  can  be  used  to  specify  sequential  execu¬ 
tion  order  for  all  processes  following  at  the  same  indentation  level.  For  example, 

SEQ 

Q 

S 

w 

amounts  to  executing  process  Q  before  S  which  is  executed  before  W.  The  inverse  result  is  obtained  by 
using  the  construct  PAR  in  place  of  SEQ.  All  processes  within  the  scope  of  a  PAR  and  at  the  same  inden¬ 
tation  level  can  execute  in  parallel.  Two  other  control  constructs  include  the  IF  and  ALT  processes.  An 
IF  process  comprises  a  number  of  processes  each  preceded  by  a  boolean  expression.  In  this  case  evaluation 
of  the  boolean  expressions  is  performed  in  lexicographic  order;  the  process  following  the  first  boolean 
expression  to  evaluate  to  true  is  the  one  which  is  executed.  The  ALT  construct  can  be  thought  of  as  an  IE 
process  whose  branches  are  evaluated  simultaneously,  and  the  first  branch  to  satisfy  the  test  condition  is 
the  one  taken.  However,  in  ALT  each  component  process  is  preceded  by  an  input  process,  or  by  a  boolean 
expression  followed  by  an  input  process.  The  first  input  process  that  completes  successfully  is  the  one 
whose  branch  is  taken.  These  input  processes  act  like  guard  inputs.  Consider  the  following  process. 

CHAN  chanl,  chan2: 

INT  v*. 

ALT 

chanl  ?  v 
IF 

v>0 

v:=0 

TRUE 

SKIP 

chan2  ?  v 
v :  =0 

Variables  chanl  and  chan 2  are  declared  to  be  of  type  channel  and  v  of  type  integer  (all  three  are  local 
to  the  process  following  at  the  same  indentation  level,  i.e.,  ALT).  In  the  input  guards  in  the  above  frag¬ 
ment  input  a  value  from  chanl  or  chan2  to  variable  v.  If  the  value  comes  from  chanl  then  the  pro¬ 
cess  following  that  guard  is  executed.  In  that  case,  if  v  is  positive  it  becomes  0  otherwise  it  remains  unal¬ 
tered.  SKIP  is  a  special  empty  process.  If  input  is  performed  through  channel  chan 2  then  v  becomes 
also  0.  If  neither  of  the  input  guards  completes  then  the  process  is  deadlocked.  If  both  guards  complete 
simultaneously  one  path  is  chosen  arbitrarily.  To  avoid  this  type  of  nondeterminism  a  prioritized  alterna¬ 
tive  process  can  be  specified  as  PRI  ALT.  Ir  that  case  priority  is  given  to  the  process  which  is  lexically 
first. 

Occam  allows  to  specify  replicated  constructs  using  the  above  processes.  The  general  syntax  of  a 
replicated  process  is 

<header>  indtx^baee  FOR  count 
proeets-body 

where  <header>  can  be  any  of  the  SEQ.  PAR,  IF,  or  ALT  constructs,  proceet-body  is  a  sequence  of 
processes  for  SEQ  and  PAR,  a  sequence  processes  each  preceded  by  a  boolean  for  IF,  or  a  sequence  of 
processes  each  preceded  by  an  input  guard  or  by  a  boolean  and  an  input  guard  for  ALT.  The  effect  of  a 
replicator  is  to  spawn  a  number  of  'header  >  processes  indicated  by  count.  Each  process  can  be  refer¬ 
enced  by  its  index  value.  Consider  the  following  example. 

[10]  [10]  INT  a,  b.  c  : 
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INT  i  : 

PAR  i=l  FOR  10 
INT  i  : 

SEQ  j=l  FOR  10 
INT  s.  k  : 
s :  =0 

SEQ  k=l  FOR  10 

s  :=  s  +  a [i]  [k]  *  b[k]  [j] 
c[i]  [j]  :=  s 

The  above  Occam  program  performs  the  multiplication  of  two  matrices,  a  and  b  and  stores  the  result  in 
c.  The  replicated  PAR  spawns  10  independent  processes,  each  computing  a  vector  times  matrix  product. 
Thus  the  replicated  PAR  construct  serves  as  a  parallel  loop  in  this  case.  Similarly,  the  two  inner  replicated 
SEQs  function  as  serial  loops  (even  though  the  middle  SEQ  could  be  replaced  by  a  replicated  PAR). 

This  example  brings  up  many  "loose  ends"  of  Occam.  The  semantics  of  the  language  is  not  clear  in 
many  cases  and  it  is  probably  left  up  to  each  Implementation  to  decide  the  exact  semantics  of  the 
language.  First,  let  us  consider  the  replicated  PAR.  According  to  the  definition  parallel  Occam  processes 
can  execute  simultaneously  on  different  processors,  say  transputer  nodes.  On  the  other  hand,  processes 
executing  in  parallel  can  communicate  only  through  channels  defined  by  the  programmer.  If  in  the  above 
example  we  assume  that  all  three  matrices  are  allocated  to  the  local  memory  of  one  processor,  then  each  of 
the  ten  processes  should  send  the  locally  computed  row  of  c  to  the  appropriate  processor  through  a  user- 
defined  channel.  In  the  example  above  this  does  not  happen.  By  using  an  assignment  statement  to  store 
the  computed  elements  of  c,  we  imply  that  the  ten  parallel  processes  created  by  the  replicated  PAR  are  to 
execute  on  the  same  node,  and  thus  are  effectively  serialised.  On  the  other  hand,  a  smart  compiler  could 
detect  this  case  and  translate  the  assignment  into  a  distributed  assignment  statement  which  completes 
from  two  ends  of  a  common  channel.  This  could  be  done  for  example  as 

channel  ?  s 
channel  !  c[l][J] 

where  channel  is  one  of  ten  user-defined  channels.  Thus,  even  the  low  level  details  of  process  formation 
and  processor  allocation  is  left  to  the  discretion  of  the  user.  This  puts  an  unreasonable  demand  on  the 
average  user  and  it  is  in  complete  disharmony  with  one  of  the  main  goals  of  parallel  programming,  i.e.,  to 
obscure  low  level  details  that  demand  high  level  of  expertise  from  the  user.  To  the  best  of  our  knowledge 
this  problem  is  not  eased  by  existing  Occam  compilers;  more  needs  to  be  done  for  developing  optimising 
Occam  compilers  and  restrncturers. 

#.  Parallelising  Compilers 

Automatic  program  restructuring  and  optimisation  for  SIMD  parallelism  is  an  extensively  researched 
subject  which  was  pioneered  by  Kuck  and  his  colleagues  at  the  University  of  Illinois,  Kennedy  and  his  asso¬ 
ciates  at  Rice  University,  and  Allen  and  her  colleagues  at  IBM  [ABCC87],  [AlKe82|,  [AlKe87],  [Bane88], 
[KKLW80),  [KKPL81],  [PaWo88].  Many  other  researchers  at  several  institutions  have  made  very 
significant  contributions  on  the  subject  as  well.  Many  of  these  ideas  are  directly  applicable  to  restructur¬ 
ing  for  MIMD  parallelism  although  this  later  subject  is  far  from  being  well-established.  Restructuring  for 
SIMD  parallelism  is  more  commonly  referred  to  as  veetorization.  Restructuring  for  MIMD  parallelism  is 
known  as  parallelization  or  concnrrentization  with  the  former  being  the  term  of  our  choice. 

Parallelising  compilers  transform  not  only  serial  programs,  but  they  can  further  parallelise  programs 
written  in  a  parallel  language.  Moreover,  such  compilers  can  be  used  to  partially  validate  the  user- 
specified  parallelism  in  an  already  parallel  program.  This  section  addresses  the  most  important  issues 
involved  in  automatic  parallelisation,  and  discusses  viable  solutions.  Since  it  is  not  possible  to  give  a 
thorough  introduction  to  this  subject  in  this  paper,  we  have  chosen  to  discuss  representative  transforma¬ 
tions  as  examples  [ChCi87],  [Cytr84],  [GiPo88j,  [Nico84],  [Kuck78],  [PaWo86],  (WoIf82).  A  more  complete 
discussion  on  program  transformations  can  be  found  in  [AlKe87],  [PaWo86],  [Poly88],  [Wolf82]. 
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8.1.  Computing  Data  Dependences 

Automatic  parallelisation  is  based  on  data  and  control  flow  information  that  the  compiler  gathers  by 
analysing  the  program.  Obviously,  the  more  accurate  the  information  gathered  by  the  compiler,  the  more 
aggressive  the  parallelisation.  During  dependence  analysis,  the  compiler  examines  the  Sow  of  data  through 
a  program  to  determine  execution  orderings  which  need  to  be  obeyed  for  not  violating  the  original  seman¬ 
tics  of  the  program.  Before  we  make  a  brief  introduction  to  data  dependences  we  need  to  establish  the 
necessary  notation. 

A  program  is  a  list  of  t£JV  statements.  5,  denotes  the  »-th  statement  in  a  program  (counting  in  lexi¬ 
cographic  order).  /y  denotes  a  DO  loop  index,  and  ty  a  particular  value  of  /•.  N-  is  the  upper  bound  of  a 
loop  index  /y,  and  all  loops  are  assumed  to  be  normalised,  i.e.,  the  values  of  an  index  /y  range  between  1 
and  iVy .  We  have  two  types  of  statements,  scalar  and  indexed  statements.  An  indexed  statement  is  one  that 
appears  inside  a  loop,  or  whose  execution  is  explicitly  or  implicitly  controlled  by  an  index  (e.g.,  vector 
statements).  All  other  statements  are  scalar.  The  degree  of  a  program  statement  is  the  number  of  distinct 
loops  surrounding  it,  or  the  number  of  dimensions  of  its  operands.  S, ...,Ik)  denotes  a  statement  of 

degree  k,  where  Jy  is  the  index  of  the  y-th  loop  or  dimension.  An  indexed  statement  S,(/,,...,Ik)  has  JJ/Vy 

different  instances,  one  for  each  value  of  each  of  /y,(j=l,...,k).  St  will  be  used  in  place  of  5i(/„...,Ik)  when¬ 
ever  the  set  of  indices  is  obvious.  We  say  that  statement  S,.  precedes  Sy  in  the  order  of  execution,  and 
denote  it  by  S,-  <  Sy,  if  under  the  serial  control  flow  S,  is  executed  before  S-. 

0 

Two  statements  S,  and  Sy  are  said  to  be  involved  in  a  flow  dependence  Si6Sj  if  and  only  if  S,  pre¬ 
cedes  Sj  in  the  serial  execution  order,  and  a  variable  written  by  S,  is  read  by  Sy.  An  antidependence 
between  S ,•  and  Sy  is  defined  as  the  flow  dependence  above,  except  that  in  this  case,  a  variable  read  by  S,.  is 
written  by  Sy;  an  antidependence  is  denoted  by  St  SSy.  An  output  dependence  is  again  defined  as  above 
but  with  S(  and  Sf  writing  to  the  same  variable,  and  is  denoted  by  SiS°Sj.  In  all  three  cases  S,.  is  called 
the  dependence  source  and  Sy  is  the  dependence  sink.  For  each  data  dependence  involving  statements 

5,(l, . ik)  and  Sj(]\ . jt)  of  degree  k  we  define  the  r-th  distance  <t>„  or  4>r(6),  to  be  <Pr=jr-ir,  (1  <r<k). 

The  k-tuple  is  called  the  dependence  distance  vector.  As  an  example  of  dependence  calcula¬ 

tion,  consider  the  following  loop 

DO  I  =  1.  N 
s  j,  A  (I+K)  =  .  .  . 

*J:  •••  =  A  (I) 

ENDO 

where  K  is  a  nonnegative  integer  constant.  Here  sl<s2  and  /JV(s,)  D  OUT[s,)  rt0,  but  we  cannot  deter¬ 
mine  yet  whether  a  flow  dependence  from  Sj  to  s 2  exists.  Several  factors  must  be  considered  in  this  case. 
For  a  dependence  to  exist  we  must  have  two  values  of  the  index  /,  /,  and  f2,  such  that  1  </,</, <W  and 
Il+K=I2.  To  test  this  we  must  know  the  values  of  K  and  N.  In  most  programs  the  value  of  A  is  known 
at  compile-time  but  this  is  not  always  true  for  loop  bounds  like  N.  If  K<iN  then  a  dependence  may  exist. 
However  if  K>N  no  dependence  between  the  two  statements  can  exist.  Frequently  loop  bounds  are  not 
known  at  compile-time  but  to  be  on  the  safe  side  they  are  assumed  to  be  "large".  In  general,  dependences 
can  be  computed  by  solving  a  Diophantine  equation  similar  to  the  above.  Algorithms  for  computing  data 
dependences  are  given  in  [Bane88],  [WoBa87]. 

The  program  data  dependence  graph  or  DDG,  is  a  directed  graph  G{V,E)  with  a  set  of  nodes  V 
corresponding  to  statements  in  a  program,  and  a  set  of  arcs  E  representing  data  dependences  between 
statements.  A  DO  loop  denotes  an  fixed-iterative  loop,  which  is  serial.  A  loop  whose  iterations  can  execute 
in  paral'el  and  in  any  order  is  called  DOALL.  The  dependences  in  certain  loops  may  allow  only  partially 
overlapped  execution  of  successive  iterations.  These  loops  are  called  DOACROSS  and  are  mentioned  in  only 
a  few  cases  in  this  paper  [Cytr84j.  Of  course,  a  loop  is  marked  as  being  DO.  DOALL,  or  DOACROSS  after 
the  necessary  dependence  analysis  for  that  loop  has  been  carried  out. 
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S.2.  Automatic  Program  Paralleltiatton 

After  determining  dependence*  and  building  the  DDG,  the  compiler  starts  the  proceaa  of  optimiaing 
and  restructuring  the  program  from  a  aerial  into  a  parallel  form.  During  the  optimiiation  phaae  several 
architecture  independent  optimisationa  are  applied  to  the  aource  code  to  make  it  more  efficient,  more  suit¬ 
able  for  restructuring,  or  both.  The  restructuring  phase  which  actually  transforms  the  program  into  a  vec¬ 
tor  and/or  parallel  form  is  also  organised  into  architecture  independent  and  architecture  dependent  sub¬ 
phases. 

Loop  Veetorlsatlon  and  Loop  Distribution!  When  compiling  for  a  vector  machine,  a  vectorising  com¬ 
piler  attempts  to  generate  vector  instructions  out  of  innermost  loops  [Kenn80],  [PaWo86],  [Wolf82].  To  do 
this  the  compiler  must  check  all  dependences  inside  the  loop.  In  the  most  simple  case  where  dependences  do 
not  exist  the  compiler  can  dittrUnte  the  loop  around  each  statement  and  create  a  vector  statement  for  each 
case.  Vectorising  the  following  loop 

DC  I  =  1.  N 

A(I)  =  B  (I)  v  C  (I) 

*,:  D(I)  =  B(I)  *  K 

ENDO 

would  yield  the  following  vector  statement* 

s,!  A  (1 :  N)  =  B  (1 :  N)  +  C(1:M) 

*,:  D  (1  :N)  =  K  *  B(1:N) 

In  the  original  loop  one  element  of  A  and  one  element  of  D  were  computed  at  each  iteration.  In  the  latter 
case  however  all  elements  of  A  are  computed  before  computation  of  D  starts.  This  is  the  result  of  distri¬ 
buting  the  loop  around  each  statement.  In  general,  loop  distribution  around  two  statements  s,  and  ij  (or 
around  two  blocks  Bt  and  B/)  is  legal  if  there  is  no  dependence  between  and  *,■  [Bi  and  Bt ),  or  if  there 
are  dependences  in  only  one  direction.  By  definition,  vectorisation  is  only  possible  on  a  statement-by¬ 
statement  basis.  Therefore  in  a  multistatement  loop,  loop  distribution  must  be  applied  before  vector  code 
can  be  generated.  As  an  example,  consider  the  following  serial  loop. 


DO  I  =  1,  N 

A(I  +  1)  =  B(I-l)  ♦  C(I) 
B (I)  =  A(I)  *  K 
C  (I)  =  B  (I)  -  1 
ENDO 


The  data  dependence  graph  is  shown  on  the  right.  A  simple  traversal  of  the  dependence  graph  would 
reTeal  its  strongly  connected  components  (SCC).  Loop  distribution  take*  place  around  each  SCC.  Those 
SCCs  with  single  statements  (that  do  not  have  self-dependences)  can  be  vectorised.  The  result  of  vectoris¬ 
ing  the  above  loop  would  be: 

DO  I  =  1.  N 

A  (I  +  1)  =  B(I-l)  ♦  C  (I) 

B  (I)  =  A  ( I )  ♦  K 
ENDO 

C(1:N)  =  B  (1 : N)  -  1 
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Due  to  the  dependence  cycle  the  loop  cennot  be  distributed  around  »,  and  i2. 

Loop  Interchange!  Loop  interchange  can  used  to  interchange  the  nest  depth  of  a  pair  of  loops  in  the 
same  nest,  and  can  be  applied  repeatedly  to  interchange  more  than  two  loops  in  a  given  nest  of  loops 
[PaWo86],  [Wolf82].  As  mentioned  above,  when  loops  are  nested  vectorisation  is  possible  only  for  the 
innermost  loop.  Loop  interchange  can  be  used  in  such  cases  to  bring  the  vectorisable  loop  to  the  innermost 
position.  For  example,  the  following  loop 

DO  I  =  2.  N 
DO  J  =  2,  M 

A(I.J)  =  A(I,  J-l)  +  1 
ENDO 
ENDO 

is  not  vectorisable  because  the  innermost  loop  (a  recurrence)  must  be  executed  serially.  But  the  outermost 
loop  is  parallel.  By  interchanging  the  two  loops  and  vectorising  the  innermost  we  get 

DO  J  =  1,  M 

A(liN.J)  =  A(1:N,  J-l)  +  1 
ENDO 

Loop  interchange  is  not  always  possible.  In  general  a  DOALL  loop  can  be  interchanged  with  any  loop 
nested  inside  it.  The  inverse  is  not  always  true.  Serial  loops  for  example  cannot  always  be  interchanged 
with  loops  surrounded  by  them.  Interchange  is  illegal,  for  example,  in  the  following  loop. 

DO  I  =  2,  N 
DO  J  =  1,  M 

A(I,J)  =  A(I  -1 ,  J+l)  +  1 
ENDO 
ENDO 

In  general  loop  interchange  is  impossible  when  there  are  dependences  between  any  two  statements  of  the 
loop  with  "<"  and  ">“  directions  [Wolf82].  In  vectorisation,  interchange  should  be  done  so  that  the  loop 
with  the  largest  number  of  iterations  is  brought  to  the  innermost  position.  This  would  create  vector  state¬ 
ments  that  will  operate  on  long  vector  operands.  For  memory-to-memory  systems  (e.g.,  CDC  Cyber  205) 
long  vectors  are  particularly  important.  If  on  the  other  hand  we  compile  for  a  scalar  multiprocessor,  bring¬ 
ing  the  largest  loop  (in  terms  of  number  of  iterations)  in  the  outermost  position  is  more  desirable,  since 
that  would  allow  the  parallel  loop  to  use  more  processors. 


Node  Splitting  and  Statement  Reordering!  Loop  vectorisation  and  parallelisation  is  impossible  when 
the  statements  in  the  body  of  the  loop  are  involved  is  a  dependence  cycle.  Dependence  cycles  that  involve 
only  flow  dependences  are  hard  to  break.  There  are  cases  however  where  dependence  cycles  can  be  broken 
resulting  in  total  or  partial  parallelisation  of  the  corresponding  loops.  One  case  where  cycle  breaking  is 
possible  is  with  dependence  cycles  that  involve  flow  and  antidependencea.  Consider  for  example  the  follow¬ 
ing  loop  whose  statements  are  involved  in  a  dependence  cycle  with  a  flow  and  an  antidependence. 

DO  I  =  1,  N 
s,!  A(I)  =  B  (I)  v  C(I) 

Sj!  D(I)  =  A(I-l)  *  A  (I  +  1) 

ENDO 

Due  to  the  dependence  cycle  this  loop  cannot  be  vectorised.  Node  splitting  can  be  employed  here  to  elim¬ 
inate  the  cycle  by  splitting  the  node  (statement)  which  causes  the  anti-dependence.  This  is  done  by  renam¬ 
ing  variable  A(I+1)  as  TEMP  (I)  and  using  its  new  definition  in  statement  «2.  After  the  cycle  is  broken 
loop  distribution  can  be  used  to  distribute  the  loop  around  each  statement.  The  loop  can  now  be  distri¬ 
buted  around  «2  but  not  around  < ,  and  s2,  since  there  are  dependences  in  both  directions.  Statement  reord¬ 
ering  can  be  used  here  to  reorder  the  statements  of  the  loop  (reordering  is  not  always  legal).  The  loop 
satisfies  now  the  "one  direction  dependences"  rule,  and  thus  it  can  be  distributed  around  each  statement 
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resulting  in  the  three  vector  statement*  shown  below. 

TEMP  (1  :N)  =  A(2:N+1) 

A  (1  :N)  =  B(liN)  +  C(1:N) 
D(1:N)  =  A(0:N-1)  *  TEMP(1:N) 


Loop  Blocking!  Loop  blocking  or  ctrip  mining  is  a  transformation  that  creates  doubly  nested  loops  out  of 
single  loops,  by  organising  the  computation  in  the  original  loop  into  chunks  of  equal  sise  [PaWo86j.  Loop 
blocking  can  be  useful  in  many  cases.  It  is  often  used  to  manage  vector  registers,  caches,  or  local  memories 
with  small  sises.  Many  vector  computers  for  example  have  special  vector  registers  that  are  used  to  store 
vector  operands.  Loop  blocking  can  be  used  to  partition  vector  statements  into  chunks  of  sise  K,  where  K 
is  the  length  of  the  vector  registers.  The  following  loop 

DO  I  =  1.  N 

A(I)  =  B  (I)  +  C  (I) 

ENDO 

will  become  (after  blocking) 

DO  J  =  1,  N,  K 

DO  I  =  J,  MIN  (J+K,  N) 

A(I)  =  B  (I)  v  C  (I) 

ENDO 

ENDO 

In  the  same  way  blocking  can  be  used  to  overcome  sise  limitations  of  caches  and  local  memories.  In 
parallel-vector  machines  operations  with  long  vector  operands  can  be  partitioned  into  shorter  vectors, 
assigning  each  of  the  short  vectors  to  a  different  processor.  In  this  case  loop  blocking  will  introduce  a 
parallel  loop  as  in  the  following  example.  Consider  the  vector  statement. 

A (1  :N)  =  B(1:N)  *  C(1:N) 

In  a  system  with  /’-processors  (and  if  N>  >P),  this  vector  operation  can  be  speeded  up  further  by  block¬ 
ing  it  as  follows. 

K  =  TRUNC (N/P) 

DOALL  I  =  1.  P 

A  ( (I  -1)  K+l :  IK)  =  B((I-1)K+1:IK)  * 

C  ( (I  -1)  K  +  l :  IK) 

ENDO 

A(PK+1:N)  =  B (PK+1  :N)  *  C(PK+1:N) 

Notice  that  iteration  pealing  was  implicitly  used  to  eliminate  the  use  of  the  intrinsic  function  MIN  in  the 
DOALL  statement. 

Cycle  Shrinking!  Cycle  ckrinking  transforms  a  serial  DO  loop  into  two  perfectly  nested  loops;  an  outer 
aerial  and  a  parallel  inner  loop  (Poly 88].  It  is  based  on  the  observation  that  although  there  is  a  static  Sow 
dependence  S,dS,  between  two  statements  5,  and  S,  of  a  loop,  there  may  be  instances  of  St  and  S2  that 
are  not  involved  in  a  dependence  (if  the  dependence  distance  is  greater  than  one).  Cycle  shrinking  extracts 
these  dependence-free  instances  of  the  statements  inside  a  loop,  and  creates  an  inner  parallel  loop.  Con¬ 
sider  for  example  the  following  loop. 

DO  I  =  1.  N 

X  (I)  =  Y(I)  v  Z  (I) 

Y  (1  +  3)  =  X  (I  -4)  *  W(I) 

ENDO 

Such  a  loop  would  be  treated  as  serial  by  the  existing  compilers.  However,  if  cycle  shrinking  is  applied  the 
same  loop  will  be  transformed  to  the  following  one. 

DO  J  =  1,  N.  3 
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DOALL  I  =  J,  J+2 

X(I)  =  Y  (I)  +  2(1) 

Y (1  +  3)  =  X (I -4)  *  W(I) 

ENDO ALL 
ENDO 

The  transformed  loop  can  now  be  executed  X=3  times  faster,  where  X  is  the  cycle  reduction  factor .  The 
larger  the  distance  X,  the  greater  the  speedup.  Consider  a  DO  loop  with  k  statements  which  are  involved  in 
a  dependence  cycle.  If  the  reduction  factor  for  that  cycle  is  X,  cycle  shrinking  results  in  an  improvement 
factor  of  X«t.  This  is  true  since  not  only  the  iterations  of  the  DOALL  loop  created  by  cycle  shrinking  are 
independent,  but  the  statements  inside  each  iteration  are  also  independent.  Thus  parallel  loop  execution 
can  be  combined  with  parallel  execution  of  the  statements  of  each  iteration. 

Loop  Spreadtngi  Few  transformations  exist  for  interloop  optimisations  and  all  of  them  regard  memory 
related  optimisations  (e.g.,  loop  fusion  [Wolf82j).  Loop  tpreading  extracts  parallelism  from  loops  in  cascade 
[GiPo88b].  In  particular,  loop  spreading  is  most  useful  tor  chains  of  serial  loops  with  interloop  depen¬ 
dences.  The  transformation  works  by  pairing  off  iterations  from  different  adjacent  loops  and  executing 
those  iterations  in  parallel.  Thus  the  expected  speedup  improvement  is  bounded  by  the  number  of  serial 
loops  in  a  chain.  In  this  section  we  give  an  example  of  a  simple  case  of  two  serial  loops  with  interloop 
dependences.  A  complete  version  of  loop  spreading  appears  in  [GiPo88bj. 

Let  B ,  and  B2  be  two  serial  loops  in  sequence  and  let  0,(j)  denote  the  y-th  iteration  of  Br  The 
transformation  will  produce  a  new  serial  loop  such  that  each  iteration  of  the  new  loop  will  execute  one 
iteration  of  Bl  (say  £,(»))  and  one  iteration  of  B,  (say  0, («'-*))  in  parallel.  The  problem  is  to  determine 
which  iterations  should  be  combined  in  order  to  maximise  the  parallelism  yield  by  loop  spreading.  This  is 
equivalent  to  computing  the  best  value  of  k  in  0,(i—k).  Consider  for  example,  the  following  serial  loops 
with  interloop  dependences. 

DO  I  =  1,  10 

X  (31  +4)  =  A  (I  -1)  +1 
A(I)  =  Y  (-21+25) 

ENDO 

DO  I  =  1.  10 

D  (I )  =  X  (41  +  2) 

X(I+1)  =  D (I)  * *2  +  D(I-l) 

E  (I)  =  Y  (-21  +  23) 

ENDO 

Notice  that  loop  spreading  does  not  alter  the  order  of  execution  of  the  iterations  of  each  loop.  Thus,  loop 
contained  dependences  are  satisfied  in  all  cases.  The  transformation  must  assure  that  interloop  depen¬ 
dences  are  also  satisfied,  by  choosing  a  value  of  k  such  that  the  pair-wise  parallel  execution  of  iterations  of 
the  two  loops  does  not  violate  any  dependence.  Or  equivalently,  if  iterations  0,(»)  and  0,(i-k)  are  to  be 
executed  in  parallel  we  must  verify  that  no  flow,  anti,  or  output  dependences  exist  for  all  i  such  that 
1  <  i  <  N .  It  is  clear  that  for  maximum  parallelism  we  must  choose  the  minimum  value  of  k,  which  in  the 
case  of  the  above  example  is  h®* 3.  Hence  after  loop  spreading  the  above  loop  becomes 

DO  I  =1,  10 
COBEGIN 
0,(1): 

IF  (I  >  K)  THEN  0,(1  -3) ; 

COEND 

ENDO 


DO  I  =  8,  10 

0,(1) 


i 
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ENDO 

Assuming  3,(7)  and  32[I)  take  the  lame  time,  r,  to  execute,  the  total  execution  time  of  the  original  loops  is 
20r,  while  that  of  the  transformed  loop  ia  13r.  This  is  the  best  possible  overlap  for  the  above  example. 

Run— Time  Dependence  Testing  Most  techniques  that  have  been  developed  to  analyte  array  subscripts 
and  determine  loop  dependences,  solve  this  problem  at  compile-time.  This  of  course  is  desirable  because 
there  is  no  run-time  overhead.  Another  alternative  would  be  to  determine  data  dependences  dynamically 
at  run-time.  The  next  transformation  (run-time  dependence  checking  or  RDC),  does  precisely  this.  When 
the  dependence  distances  vary  between  different  iterations,  cycle  shrinking  is  rather  conservative.  RDC  is  a 
more  suitable  technique  since  it  sequentialises  only  those  iterations  that  are  involved  in  a  true  dependence. 
All  remaining  iterations  can  execute  in  parallel.  Consider  for  example  the  following  loop 

DO  I  =  1.  N 

A(2I  -1)  =  B(I-l)  +  1 
B  (21  +  1)  =  A(I  +  1)  *  C  (I ) 

ENDO 

The  distance  of  the  flow  dependence  from  the  first  to  the  second  statement  can  take  the  values  1,  2,  3,.... 
RDC  has  two  phases.  An  implicit  and  an  explicit  phase.  The  implicit  part  involves  computations  performed 
by  the  compiler  which  are  transparent  to  the  user.  The  explicit  phase  transforms  the  loop  itself.  The  basic 
idea  is  to  be  able  to  determine  at  run-time  whether  a  particular  iteration  of  a  loop  depends  on  one  or  more 
previous  iterations.  This  requires  some  recording  of  relevant  information  from  previous  iterations.  For  a 
loop  DO  1  =  1,  N  and  for  a  dependence  ffy  6,  Sj+l  in  that  loop,  we  define  the  dependence  source  vector 
(or  DSV)  Ry  to  be  a  vector  with  N  elements,  where  non-sero  elements  indicate  the  values  of  I  for  which 
Sj  is  a  dependence  source,  and  sero  elements  in  Ry  correspond  to  values  of  I  for  which  S-  is  not  involved 
in  a  dependence.  The  elements  of  DSV  are  initialised  to  sero  by  the  compiler.  A  single  bit-vector  V  with 
subscripts  in  the  range  [1...N]  is  also  created  and  is  initialised  to  sero.  Vector  V  is  called  the  synchroniza¬ 
tion  vector. 

Following  initialisation,  the  compiler  inserts  in  the  transformed  loop  code  which  records  dependence 
sources  and  synchronises  dependence  sinks  on  their  corresponding  sources.  Even  though  the  compiler 
inserts  dependence-testing  code  in  a  loop,  the  actual  dependence  resolution  occurs  during  the  execution  of 
the  loop.  The  transformed  version  of  the  previous  loop  is  shown  below. 

DOALL  X  =  1,  N 
COBEGIN 

IF  (1<R,(H-1)  <  I)  WAIT  ON  V  (I  +  1)  ; 

IF  (1  <  Rj (I -1)  <  I)  WAIT  ON  V(I-l); 

COEND 

A  (21  -1)  =  B(I-l)  +  1 
B  (21  +  1)  =  A(I  +  1)  ‘  C  (I) 

CLEAR  V (I) 

ENDO 

Rnn-time  dependence  checking  belongs  to  a  family  of  transformations  called  hybrid  (static  and  dynamic) 
transformations.  In  all  cases  the  principle  is  the  same:  if  the  compiler  cannot  resolve  a  particular  problem 
but  it  can  precisely  identify  the  problem,  it  can  generate  code  which  solves  that  problem  during  execution, 
when  run-time  information  becomes  available. 

7.  Partitioning!  Synchronisation,  and  Scheduling 

In  Sections  S  and  0  we  reviewed  manual  parallel  programming  through  the  use  of  parallel  languages, 
and  automatic  program  restructuring  via  parallelising  compilers.  However,  the  major  advantage  of  paral¬ 
lelising  compilers  is  not  only  automatic  parallelisation  of  serial  programs,  but  their  i . ntial  to  automate 

the  process  of  program  partitioning,  synchronisation,  and  scheduling  (which  are  by  far  more  complex  than 
parallelism  detection  and  often  ad  hoc). 


212 


7.1.  Partitioning  and  Task  Formation 

Program  partitioning  ia  one  of  the  moat  important  phases  of  parallel  programming.  It  is  closely 
linked  to  scheduling  and  even  though  it  is  discussed  separately  here,  partitioning  should  be  considered  in 
conjunction  with  scheduling.  Informally,  the  partitioning  phase  decides  which  modules  of  a  program  can 
execute  in  parallel  and  how.  A  parallel  program  can  be  represented  by  a  directed  graph  called  the  task 
graph,  where  nodes  represent  program  modules  and  ares  represent  ordering  and  dependence  relations 
between  nodes.  The  nodes  of  a  program  graph  are  called  tasks.  A  task  can  consist  of  a  single  or  multiple 
u ter  processes  or  u-proeesses.  U-processes  are  units  of  code  with  their  own  instruction  stream  and  private 
memory  space.  Thus  tasks  are  static  entities  which  simply  correspond  to  specific  modules  of  a  program. 
On  the  other  hand  u-processes  are  generated  from  tasks  upon  dispatching,  and  in  addition  to  their  code 
they  own  part  of  the  virtual  address  space. 

So  far  partitioning  has  been  treated  mostly  from  a  theoretical  point  of  view  as  a  problem  which  is 
often  decoupled  from  scheduling  [Bokh88].  Most  of  the  existing  partitioning  algorithms  assume  an  ideal¬ 
ised  representation  of  a  program  which  usually  takes  the  form  of  a  directed  graph  as  the  initial  representa¬ 
tion.  This  graph  can  represent  a  user-defined  partition  or  a  compiler  representation  of  the  program  graph 
(e.g.,  control  flow  or  data  dependence  graph,  or  a  higher-level  program  representation).  Nodes  represent 
tasks  and  arcs  represent  communication  channels  between  tasks.  Such  graphs  are  evaluated  under  the 
assumption  that  each  node  may  execute  on  a  different  processor  at  any  time.  Assumptions  such  as  known 
execution  times  of  nodes  and  communication  weights  among  nodes,  are  then  used  to  derive  a  more  efficient 
task  graph.  This  can  be  done  by  merging  nodes  together  to  eliminate  heavy  communication  links,  or  by 
splitting  large  nodes  into  smaller  ones  to  increase  the  degree  of  parallelism  in  the  graph.  The  end-result  is 
a  more  efficient  representation  of  the  program  graph  or  even  an  explicit  assignment  of  groups  of  nodes  to 
specific  processors.  Even  though  the  above  models  are  far  from  being  representative  of  real  programs,  the 
algorithms  which  have  been  developed  can  be  used  for  approximate  solutions  and  have  contributed  to  the 
analysis  and  understanding  of  many  important  aspects  of  the  problem. 

On  a  more  practical  basis  the  partitioning  problem  can  be  considered  from  two  different  angles:  the 
data  and  the  instruction  stream  viewpoint.  In  the  first  case,  partitioning  is  based  on  the  decomposition  of 
data  objects  upon  which  computation  is  performed.  Each  processor  is  assigned  the  work  corresponding  to  a 
specific  data  domain.  This  form  of  partitioning  is  often  called  data  partitioning  or  horizontal  partitioning. 
Data  partitioning  is  feasible  when  the  same  type  of  computation  is  performed  on  all  data  domains.  Typical 
computations  of  this  type  include  loops  and  other  repetitive  computations,  i.e.,  the  same  u-process  is  exe¬ 
cuted  for  each  data  domain. 

The  second  type  of  partitioning  called  functional  or  vertical  partitioning  results  in  the  formation  of 
tasks  from  syntactically  identifiable  pieces  of  code.  Thus  different  partitions  operate  on  different  data 
objects  or  on  the  same  data  object  but  in  some  specific  order.  For  example,  forming  two  tasks  out  of  two 
disjoint  outer  loops  or  two  different  subroutine  calls  is  a  case  of  functional  partitioning.  Another  common 
term  for  functional  partitioning  is  high-level  ip  reading.  Partitioning  must  be  done  such  that  the  following 
goals  are  met. 

e  The  tasks  formed  by  partitioning  a  program  should  be  as  independent  as  possible,  i.e., 
sharability  of  data  objects  between  tasks  should  be  minimal.  This  implies  that  data 
objects  should  be  decomposed  such  that  different  components  correspond  to  different 
tasks.  Notice  that  both  data  and  functional  partitioning  conform  to  this  requirement. 

•  Tasks  should  be  of  approximately  equal  site.  As  it  will  be  shown  later,  this  helps  in 
balancing  the  load  across  processors  by  using  simple  and  fast  scheduling  heuristics.  Data 
partitioning  tends  to  satisfy  this  requirement  while  functional  partitioning  does  not. 

e  The  sise  of  the  tasks  formed  should  be  a  function  of  the  overhead  incurred  during  task 
scheduling  and  the  synchronisation  overhead.  Put  in  other  words,  tisks  should  be  large 
enough  compared  to  the  overhead  involved  inorder  to  achieve  any  speedup.  Roughly 
speaking,  this  means  that  the  total  overhead  associated  with  the  parallel  execution  of  a 
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talk  should  be  leu  than  the  its  serial  execution  time. 


e  A  balance  must  be  achieved  between  communication  and  scheduling  overhead,  and 
degree  of  parallelism  in  a  task  graph.  These  two  objectives  are  inconsistent  since  minimis¬ 
ing  overhead  tends  to  merge  all  tasks  into  fewer  large  tasks,  while  increasing  the  degree  of 
parallelism  (i.e.,  the  number  of  independent  tasks  in  the  graph)  tends  to  decompose  large 
tasks  into  their  smallest  constituents.  The  degree  of  parallelism  should  be  considered  in 
conjunction  with  the  number  of  available  processors. 

•  Finally  partitioning  should  be  based  on  realistic  assumptions  about  the  program.  For 
example,  the  type  of  a  task  is  readily  available  to  the  compiler  but  the  execution  time  of  a 
task  is  not.  If  exact  algorithms  are  used  for  partitioning,  e.g.  critical  path,  they  should  be 
adapted  to  compensate  for  inaccuracies. 

Even  though  we  will  return  to  the  partitioning  issue  later  in  this  paper,  for  now  we  assume  that  this  phase 
provides  the  following  phases  of  parallel  programming  with  a  well-defined  decomposition  of  a  program 
into  a  set  of  tasks. 

7.2.  Synchronisation  and  Communication 

Depending  on  the  architecture  of  the  target  machine,  explicit  synchronisation  or  communication 
instructions  must  be  inserted  in  a  parallel  program  to  ensure  correct  parallel  execution.  Again  we  look  at 
the  problem  from  the  compiler  point  of  view.  Synchronisation  may  or  may  not  be  needed  depending  on  the 
order  of  task  execution.  For  example,  if  different  tasks  involved  in  s  data  or  control  dependence  are 
allowed  to  execute  in  parallel,  then  explicit  synchronisation  instructions  must  be  inserted.  Another  type  of 
synchronisation  is  needed  if  tasks  sre  restricted  to  execute  in  parallel  only  if  they  are  independent;  in  that 
case  a  single  (barrier)  synchronisation  point  between  two  tasks  will  suffice.  Similarly,  in  the  distributed 
memory  case,  appropriate  messages  should  be  explicitly  routed  between  different  tssks.  However,  commun¬ 
ication  within  a  single  processor  (task)  can  take  the  form  of  synchronisation  through  the  local  memory  (a 
much  leas  expensive  operation). 

Synchronisation  and/or  communication  introduce  overhead  during  parallel  execution.  As  mentioned 
earlier,  overhead  estimates  are  used  in  deciding  how  a  program  can  be  partitioned  into  tasks.  Thus,  prior 
to  partitioning,  it  must  be  known  where  synchronisation  and  communication  is  needed.  This  information 
need  not  be  in  the  form  of  the  corresponding  instructions;  cost  and  some  qualitative  information  is 
sufficient  for  determining  overheads.  After  partitioning  has  been  specified,  only  those  synchronisation 
instructions  needed  to  enforce  the  dependences  expressed  in  the  task  graph  need  to  be  inserted  in  the  pro¬ 
gram. 

Alternatively,  synchronisation  and  communication  instructions  can  be  inserted  in  a  program  wher¬ 
ever  applicable.  After  partitioning  and  possibly  scheduling,  an  optimisation  phase  must  follow  to  eliminate 
redundant  synchronisation  and  communication  instructions.  Approaches  to  the  later  optimisation  problem 
are  reported  in  [MiPa88j.  It  is  also  clear  that  the  static  or  dynamic  nature  of  partitioning  and  scheduling 
affects  directly  the  generation  and  optimisation  of  synchronisation  instructions.  For  example,  if  a  task  is 
allowed  to  disintegrate  dynamically  during  execution,  static  synchronisation  (which  must  consider  the 
worst  cue)  may  result  in  superfluous  synchronisation  at  run-time. 

7.2.  Scheduling  and  Overhead  Analysis 

Scheduling  is  one  of  the  most  performance-sensitive  phases  of  parallel  programming.  Deciding  what 
task  or  proceu  executes  on  what  processor  and  in  what  order  is  a  nontrivial  problem.  Again,  scheduling 
should  be  done  such  that  program  finish  time  is  minimised.  Partitioning  is  some  type  of  incomplete 
scheduling,  since  it  specifies  an  "abstract”  processor  allocation. 

Although  the  end-result  of  scheduling  (which  is  minimum  execution  time  through  load  balancing  and 
low  overhead)  is  the  tame  for  both  shared  and  distributed  memory  systems,  th<  •.  ■  ■  I'-hes  differ  for  each 
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model.  In  shared  memory  parallel  processors  all  tasks  and  data  objects  of  a  program  are  equally  accessible 
to  all  processors.  This  simplicity  allows  more  flexibility  in  scheduling  tasks.  In  distributed  memory  systems 
one  of  the  early  differences  is  the  need  to  down-load  different  tasks  on  different  processors.  Adjacency  of 
the  interconnection  structure  has  a  large  implication  on  how  efficiently  scheduling  can  be  done.  For  exam¬ 
ple,  if  a  complete  interconnection  is  used  for  a  distributed  memory  system,  a  task  graph  can  be  scheduled 
in  exactly  the  same  way  as  if  it  was  to  be  executed  on  a  shared  memory  system,  by  simply  translating  syn¬ 
chronisation  instructions  to  equivalent  message  passing  instructions.  However,  if  the  interconnection  is 
more  sparse  (e.g.,  hypercubes),  different  issues  arise.  Besides  the  extra  overhead  of  down-loading,  one  needs 
to  face  other  problems  such  as  process  migration  and  nonuniform  communication  costs  which  further  com¬ 
plicate  the  overall  optimisation  problem. 

Unless  a  precise  static  allocation  is  achieved  process  migration  should  be  allowed  to  balance  the  load 
across  the  entire  system.  Because  exact  program  information  is  unavailable  at  compile-time,  a  large 
number  of  scheduling  algorithms  for  distributed  systems  is  based  on  the  process  migration  principle;  thus 
allowing  quick  initial  allocations  which  balance  themselves  by  transmitting  and  receiving  other  tasks  from 
neighboring  processors.  This  activity  introduces  significant  space  and  time  costs  since  its  implementation 
requires  for  each  processor  the  maintenance  and  update  of  load-tables  regarding  adjacent  processors.  If 
system-wide  migration  is  allowed,  then  in  a  p -processor  machine  each  processor  needs  to  maintain  load 
information  for  the  other  p— 1  processors.  Moreover,  since  migration  outdates  this  information,  periodic 
system— wide  updates  must  take  place  by  having  processors  exchange  load  information  among  themselves. 
Even  though  the  above  is  the  best  approach  for  achieving  the  most  perfectly  balanced  load,  it  incurs  a 
tremendous  overhead.  A  middle-ground  solution  is  to  define  neighborhoods  of  processors  and  allow  process 
migration  only  within  each  neighborhood.  Thus  load  tables  and  updates  occurs  independently  within  each 
group  by  compromising  load  balancing.  How  these  neighborhoods  are  defined  depends  heavily  on  the  inter¬ 
connection  structure  and  a  number  of  other  parameters.  Down-loading,  communication,  and  process 
migration  have  a  profound  affect  on  task  granularity.  Because  all  these  extra  scheduling  activities  are  ord¬ 
ers  of  magnitude  more  expensive  that  simple  synchronisation,  tasks  need  to  be  much  larger  inorder  to 
effectively  amortise  the  overhead  and  achieve  speedups  over  serial  execution.  Increasing  substantially  the 
granularity  of  tasks  results  in  decreasing  the  degree  of  parallelism  of  a  program,  Therefore,  it  also  restricts 
the  number  of  applications  which  can  benefit  from  parallel  execution  on  such  systems. 

In  contrast,  none  of  the  above  activities,  besides  synchronisation,  occur  during  scheduling  in  shared 
memory  parallel  machines.  Migration  is  not  necessary  since  due  to  sharing  of  common  memory  and  in 
terms  of  cost,  such  a  system  can  be  thought  of  as  completely  connected  —  the  communication  cost  between 
any  pair  of  processors  is  constant.  Moreover,  down— loading  is  not  necessary  (except  is  the  case  of  fully 
static  scheduling).  Tasks  can  remain  in  shared  memory  and  be  dispatched  by  processors  as  needed.  This 
greatly  simplifies  the  scheduling  problem  for  such  arch;,*ctures,  but  even  in  this  simpler  form,  scheduling 
remains  a  crucial  albeit  all  but  trivial  problem  to  solve. 

Thus,  based  on  the  above  issues,  if  one  was  to  judge  conformable  (comparable)  shared  versus  distri¬ 
buted  memory  parallel  architectures  based  on  the  scheduling  issue  alone,  the  superiority  of  the  former 
architecture  model  is  clear.  Of  course,  if  very  "dense"  interconnection  structures  are  employed  for  distri¬ 
buted  systems  this  advantage  drifts  away.  Nevertheless,  one  can  argue  that  in  such  a  case  the  cost  of  build¬ 
ing  multiport  nodes  may  become  prohibitive  and  we  soon  converge  to  an  architecture  which  can  be  more 
effectively  realised  as  a  shared  memory. 

The  remaining  discussion  on  scheduling  focuses  on  shared  memory  architectures  even  though  many 
of  the  ideas  discussed  can  also  be  applied  to  distributed  memory  systems.  There  are  three  fundamental 
approaches  to  scheduling:  static,  dynamic,  and  hybrid.  Each  approach  depends  on  how  much  information 
about  a  program  is  available  to  the  scheduler,  as  well  as  on  the  phase  this  information  becomes  available. 
We  consider  below  each  approach  separately. 

s  Static  scheduling:  As  implied  by  the  term,  static  scheduling  can  be  performed  either  by  the  programmer 
or  by  the  compiler  before  program  execution.  Parallelism  is  exploited  by  spreading  different  computations 
over  different  processors  statically.  Thus,  the  programmer  knows  exactly  what  parts  of  a  program  will  exe¬ 
cute  on  each  processor.  For  static  scheduling  to  be  effective  one  needs  to  have  detailed  knowledge  about  a 
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Table  2.  Characteristics  of  scheduling  strategies. 


program  (such  as  the  outcome  of  conditionals,  the  sise  of  loops,  etc)  available  in  advance.  Knowledge 
about  the  architecture  of  the  target  machine  (e.g.,  number  of  physical  processors,  timing  of  I/O  and  func¬ 
tional  units)  is  also  necessary.  Such  details  are  not  usually  available  but  coarse  estimates  can  be  supplied 
by  the  programmer  or  the  compiler. 

The  main  advantage  of  static  scheduling  is  its  low  run-time  overhead;  no  scheduling  activities  occur 
during  program  execution.  It  is  also  easier  to  trace  the  execution  of  a  statically  scheduled  parallel  program 
and  thus,  debugging  becomes  less  of  a  problem.  Overhead  however  is  paid  during  compilation  or  in  terms 
of  programming  time.  Static  scheduling  algorithms  involve  usually  polynomial  or  even  exponential  com¬ 
plexity.  Therefore,  more  time  is  spend  at  compilation.  The  major  drawback  is  the  unrealistic,  in  general, 
assumptions  upon  which  static  scheduling  is  based.  The  result  is  typically  an  unbalanced  load.  Since 
parallel  execution  time  is  defined  by  the  last  processor  to  finish,  unbalanced  load  results  in  longer  parallel 
execution  times  and  low  machine  utilisation.  In  general,  pure  static  scheduling  is  not  a  suitable  approach 
for  general-purpose  parallel  machines.  It  can  be  effectively  used  for  specialised  architectures  such  as  sys¬ 
tolic  arrays  or  VLIW-based  machines. 

•  Dynamic  scheduling:  Dynamic  scheduling  is  complementary  to  static  in  both  its  advantages  and  disad¬ 
vantages.  Dynamic  scheduling  is  implemented  at  run-time  through  the  operating  system,  the  compiler,  the 
hardware,  or  a  combination  thereof.  Scheduling  is  baaed  on  simple,  typically  constant-time  heuristics 
which  work  satisfactorily  in  most  cases.  Knowledge  about  program  characteristics  is  not  needed;  at  best 
some  qualitative  knowledge  can  be  useful.  The  major  drawback  of  dynamic  scheduling  is  the  run-time 
overhead  that  it  incurs.  Since  decisions  are  taken  during  program  execution,  scheduling  activities  waste 
processor  cycles.  The  more  sophisticated  the  scheduling  heuristics  the  higher  their  complexity  and  thus  the 
higher  the  overhead.  On  the  other  hand,  dynamic  scheduling  achieves  the  highest  degree  of  load  balancing 
under  the  same  initial  conditions. 

In  principle,  dynamic  scheduling  can  be  used  at  all  levels  or  task  granularity  starting  from  the 
instruction  level  up  to  the  program  level.  However,  the  suitability  of  a  particular  scheme  depends  on  many 
factors  such  as  the  complexity  and  the  overhead  in  both  time  and  hardware.  For  general  purpose  parallel 
processors  dynamic  scheduling  is  most  appropriate  for  large-to-medium  granularity  tasks. 

•  Hybrid  scheduling:  Since  static  and  dynamic  are  complementary  approaches  with  respect  to  balancing 
and  overhead,  a  combination  of  the  two  may  result  in  more  efficient  implementations  of  scheduling  on  the 
average.  Few  hybrid  schemes  have  been  designed  or  implemented  on  real  machines.  Even  though  they 
involve  more  complexity  from  the  design  and  implementation  perspectives,  they  do  oiler  the  most  attrac¬ 
tive  alternative. 

During  hybrid  scheduling,  some  of  the  scheduling  takes  place  at  compile-time  and  some  at  run-time. 
Qualitative  information  about  programs  is  usually  enough  for  hybrid  schemes  to  work  efficiently.  We 
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overview  such  an  approach  in  Section  8.  Static  partitioning  with  dynamic  task  scheduling  is  a  type  of 
hybrid  scheduling,  since  partitioning  directly  affects  scheduling  a a  indicated  earlier.  Table  6  summarizes 
the  various  approaches  to  parallel  program  scheduling  and  their  characteristics  with  respect  to  complexity, 
overhead,  load  balancing,  and  input  information  needed  about  the  subject  program. 

7.3.1.  Loop  Scheduling,  Microtasking,  and  Macrotaaklng 

Cray  Research  was  the  first  supercomputer  vendor  to  introduce  software  which  supported  the 
specification  and  scheduling  of  parallel  constructs  on  the  Cray  X-MP  series.  This  software  known  as  multi * 
tasking  provided  only  the  environment  for  creating  and  executing  tasks  of  the  same  program  on  different 
processors,  but  it  left  the  responsibility  for  defining,  scheduling,  and  synchronising  tasks  to  the  program* 
mer.  The  multitasking  environment  was  based  on  a  collection  of  macros  which  allowed  the  user  to  create 
processes  (much  like  Unix  processes)  through  calls  to  the  operating  system.  These  invocations  to  the 
operating  system  made  multitasking  a  very  expensive  means  for  exploiting  program  parallelism.  In  addi¬ 
tion,  tasks  needed  to  be  organised  as  subroutines  with  the  extra  overhead  of  subroutine  call  paid  for  every 
instantiation  of  a  new  task.  The  overhead  associated  with  multitasking  made  it  useful  only  in  cases  where 
large  subroutines  could  execute  simultaneously.  Many  other  parallel  computer  vendors  followed  with 
different  implementations  of  multitasking. 

Later  versions  of  multitasking  software  avoided  excessive  overhead  and  made  parallelism  exploitation 
at  the  loop  level  possible.  Multitasking  at  the  loop  level  is  more  commonly  known  as  microtasking.  Micro- 
tasking  works  much  the  same  way  as  multitasking  with  the  main  difference  being  that  operating  system 
involvement  is  kept  minimal.  Instead  of  invoking  the  operating  system  to  create  a  process  and  allocate 
space  for  it  every  time  a  new  task  needs  to  be  initiated,  microtasking  creates  a  number  of  processes  at  the 
beginning  of  program  execution.  These  processes  which  we  call  here  system.  processes  or  s-processes 
remain  live  throughout  the  execution  of  a  program.  When  a  u-process  (task)  needs  to  be  scheduled  for  exe¬ 
cution,  an  s-process  is  fetched  from  a  queue  of  s-processes  and  it  is  bound  to  that  particular  u-process. 
This  is  when  an  s-process  receives  context  and  it  is  called  an  s-process  with  context  or  $-c-proce$e.  Upon 
completion  of  the  execution  of  an  s-c-procesa  the  s-process  is  not  destroyed  (as  it  was  the  case  with  early 
versions  of  multitasking),  rather  it  returns  as  an  empty  s-process  to  the  above  queue,  and  it  can  be  used 
later  to  execute  another  u-process  form  the  same  program.  Thus  s-processes  function  as  vehicles  which 
carry  u-processes  through  execution  in  a  physical  processor. 

Micortasking  avoids  unnecessary  overhead  by  reusing  s-processes  within  each  program.  In  more 
efficient  implementation  of  microtasking  s-processes  are  created  not  just  once  per  program,  but  once  when 
the  system  is  cold-started  and  they  are  bound  to  u-processes  from  possibly  different  user  programs  at  the 
same  time.  Thus  the  overhead  is  further  reduced.  The  number  of  s-processes  so  created  is  a  system  param¬ 
eter  and  it  depends  among  others  on  the  number  of  physical  processors  and  memory  space  available. 
Micortasking  is  commonly  used  for  the  execution  of  parallel  loops  by  many  processors.  In  most  cases  each 
different  loop  iteration  is  bound  to  an  s-process. 

Microtasking  as  described  above  is  a  suitable  solution  for  loop  scheduling  (data  partitioning)  but  not 
necessarily  for  high-level  spreading  (functional  partitioning)  where  the  definition  of  a  task  is  arbitrary. 
During  the  execution  of  parallel  loops  all  s-processes  inherit  the  same  calling  environment,  but  this  is  not 
always  the  case  with  tasks  generated  from  high-level  spreading.  To  support  parallelism  at  this  level  multi¬ 
tasking  is  still  used  under  the  distinguishing  term  of  macrotasking.  Macrotasking  is  a  more  tunned  imple¬ 
mentation  of  multitasking  but  it  is  based  on  the  same  framework. 

It  is  important  however,  to  realize  that  microtasking  and  macrotasking  provide  the  environment 
which  supports  parallel  programming  at  the  user-level  but  they  do  not  offer  solutions  as  to  how  tasks 
should  be  organised  (partitioning),  synchronized,  and  scheduled.  This  is  still  the  programmer’s  or  the 
compiler’s  responsibility.  In  Section  8  we  discuss  a  more  general  environment  which  in  addition  to  provid¬ 
ing  capabilities  equivalent  to  micro  and  macrotasking,  it  automates  the  process  of  partitioning  and 
scheduling  in  a  unique  way. 

Loop  Scheduling  (Process  Level) 
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Loops  in  numerical  programs  can  be  fairly  complex  with  conditional  statements  and  subroutine  calls. 
A  general  solution  should  distribute  iterations  to  processors  at  run-time  based  on  the  availability  of  pro¬ 
cessors  and  other  factors.  However,  the  overhead  associated  with  run-time  distribution  must  be  kept  very 
low  for  dynamic  scheduling  to  be  practical.  We  consider  here  three  possible  schemes  for  loop  scheduling. 
The  first  two  are  well-known  and  one  or  the  other  is  used  by  most  modern  parallel  machines;  the  third  is  a 
more  efficient  approach.  The  three  schemes  differ  in  the  number  of  loop  iterations  that  they  assign  to  each 
idle  processor,  and  thus  in  the  balancing  of  load  and  the  total  execution  time. 

One  Iteration  at  a  Time  (Self-Scheduling) t  This  scheduling  scheme  is  commonly  referred  to  as  sclf- 
scheduling .  An  idle  processor  picks  a  single  iteration  of  a  parallel  loop  by  exclusively  incrementing  the 
loop  indices  [Smit8l],  [TaY e86).  Thus  if  N  is  the  total  number  of  iterations  of  a  loop,  self-scheduling 
involves  N  dispatch  operations.  Roughly  speaking,  if  p  processors  are  involved  in  the  execution  of  the 
loop,  then  each  processor  gets  N/p  iterations.  Let  B  be  the  average  iteration  execution  time  and  <r  the 
overhead  involved  with  each  dispatch.  Then  self-scheduling  is  appropriate  if  B  »  a  and  there  is  a  large 
variation  of  the  execution  time  of  different  iterations.  Because  self-scheduling  assigns  one  iteration  at  a 
time,  it  is  the  best  dynamic  scheme  as  far  as  load  balancing  is  concerned.  However,  a  perfectly  balanced 
load  is  meaningless  if  the  overhead  used  to  achieve  it  exceeds  a  certain  threshold.  Overall,  self-scheduling 
may  be  appropriate  for  loops  with  relatively  small  number  of  iterations  which  have  variable  execution 
times,  and  only  if  a  is  small  compared  to  B. 

Chunk— Schedulings  chunk-scheduling  is  in  principle  the  same  as  self-scheduling.  But  in  this  case,  a 
fixed  number  of  iterations  (chunk)  is  allocated  to  each  idle  processor  (as  opposed  to  a  single  iteration  at  a 
time).  By  doing  so,  one  can  reduce  the  overhead  by  compromising  load  balancing.  This  is  clear  since  the 
unit  of  allocation  is  of  higher  granularity  now,  and  thus  the  potential  variation  of  finish  time  among  the 
processors  is  also  higher.  There  is  a  clear  tradeoff  between  load  balancing  and  overhead.  At  one  extreme, 
the  chunk  sise  is  roughly  N/p  and  each  processor  performs  only  one  dispatch  per  loop.  The  variation  of 
finish  time  is  also  the  highest  in  this  case.  At  the  other  extreme,  the  chunk  site  is  one  and  we  have  self¬ 
scheduling  with  perfect  load  balancing  and  maximum  overhead.  Intermediate  values  of  the  chunk  size  in 
the  range  [l...  f N/p  D  will  produce  results  that  are  better  or  worse  than  either  of  the  extreme  cases.  The 
main  drawback  of  chunk-scheduling  is  the  dependence  of  chunk  size  on  the  characteristics  of  each  loop 
which  are  unknown  even  at  run-time.  Worse  yet,  even  for  the  same  loop,  the  execution  time  is  not  mono¬ 
tonous  with  monotonically  increasing  or  decreasing  chunk  size.  This  makes  the  derivation  of  an  optimal 
chunk  size  practically  impossible  even  on  a  loop-by-loop  case. 

Guided  Self-Sehedulingt  Self-scheduling  achieves  a  perfect  load  balancing  but  it  also  incurs  maximum 
overhead.  On  the  other  hand  chunk-scheduling  is  an  (unsuccessful)  attempt  to  reach  a  compromise 
between  load  balancing  and  overhead,  and  the  result  maybe  quite  unexpectable.  The  third  scheme,  guided 
self-scheduling  (or  GSS)  [PoKu87],  is  in  general,  a  much  better  and  more  stable  approach  to  reach  this 
compromise.  The  idea  is  to  start  the  execution  of  a  loop  by  allocating  chunks  of  iterations  whose  size  starts 
from  [N/pl  and  keeps  decreasing  until  all  the  iterations  are  exhausted.  The  last  p  —  1  chunks  of  iterations 
are  of  sise  one.  Thus,  chunk  sizes  vary  between  the  two  extremes.  Figure  2  gives  the  GSS  algorithm. 

The  advantages  of  GSS  are  many.  First,  the  property  of  decreasing  chunk  sise  is  built-in  and  no 
extra  computation  is  required  to  enforce  this  policy.  This  simplicity  allows  for  easy  and  efficient  implemen¬ 
tation.  Secondly,  the  two  main  objectives  of  perfectly  balanced  load  and  small  overhead  are  achieved 
simultaneously.  By  allocating  large  chunks  at  the  beginning  of  the  loop  we  keep  the  frequent  dispatching 
and  thus  the  overhead  low.  At  the  same  time,  the  small  chunks  at  the  end  of  the  loop  serve  to  "patch 
holes"  and  balance  the  load  across  all  processors.  For  some  ideal  cases,  GSS  is  provably  optimal.  This  can¬ 
not  be  said  for  either  self  or  chunk-scheduling.  GSS  has  been  implemented  in  tb**  Cray’s  autotasking 
library  as  well  as  in  DEC’s  Fortran  5.0  compiler.  For  a  parallel  loop  with  N  iterations  the  average 
number  of  dispatch  operations  per  processor  for  self,  chunk,  and  guided  self-scheduling  is  N/p,  N/fcp,  and 
log. (N/p)  respectively. 
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Guided  Self-Scheduling 

Input  A  parallel  loop  L  with  N  iterations,  and  p  processors. 

Output  The  optimal  dynamic  schedule  of  L  on  the  p  processors.  The  schedule 
is  reproducible  if  the  execution  time  of  the  loop  bodies  and  the 
initial  processor  configuration  (of  the  p  processors)  are  known. 

•  If  R{  is  the  number  of  remaining  iterations  at  step  »,  then  set  Rx  =  N ,  i=l, 
and  for  each  idle  processor  do. 

REPEAT 

•  Each  idle  processor  (scheduled  at  step  s')  receives 

*,  = 

p 

iterations. 

•  Ri+ 1  =  Ri  ~  *, 

.  The  range  of  the  loop  index  is  /  g  [N—R^l,....,  N—R,+Zj] 

•  »'  =  »'  +  X 
UNTIL  (R,  =  0) 


Figure  2.  The  GSS  algorithm. 
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Figure  3.  The  structure  of  the  Parafrase-2  multilingual  compiler. 
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Figure  4.  The  major  modules  of  an  auto-scheduling  compiler. 


8.  A  Performance— Oriented  Environment  for  Parallel  Programming 

In  this  section  we  discuss  our  approach  to  a  fully  automated  environment  for  automatic,  efficient, 
and  effective  parallelism  specification  and  exploitation.  Having  considered  partitioning,  synchronisation, 
and  scheduling  in  a  general  context,  it  should  be  easy  to  justify  our  approach  to  an  integrated  program¬ 
ming  environment  in  the  Parafrase-2  project. 

The  organisation  of  Parafrase-2,  the  first  multilingual  parallelising  compiler  to  be  built,  is  shown  in 
Figure  3.  This  ongoing  project  at  the  University  of  Illinois’  CSRD,  aims  to  develop  a  single  parallelising 
compiler  for  many  popular  programming  languages  [PGHL89j.  Even  though  at  present  the  system  com¬ 
piles  C  and  several  Fortran  dialects,  provisions  have  been  made  to  add  Pascal  and  other  languages  in  the 
future.  Adding  a  new  language  involves  a  preprocessor  for  that  language,  which  translates  a  source  pro¬ 
gram  into  a  common  intermediate  representation.  The  major  emphasis  of  the  Parafrase-2  project  is  the 
second  phase  of  parallel  programming,  namely  that  of  parallelism  exploitation.  In  the  rest  of  this  section 
we  describe  our  approach  to  this  latter  part  of  the  compiler. 


221 


There  are  advantages  and  disadvantages  to  each  scheduling  approach.  With  the  growing  variety  and 
complexity  of  parallel  architectures  monolithic  approaches  to  program  scheduling  become  obsolete.  Both 
compilers  and  run-time  systems  need  to  cooperate  in  order  to  achieve  desirable  results.  Pure  static  schemes 
are  too  unrealistic  to  be  practical.  Similarly,  pure  dynamic  schemes  that  ignore  useful  program  informa¬ 
tion  are  bound  to  fail  badly  in  certain  cases  [Poly88j.  An  ideal  scheduler  should  use  to  its  advantage  infor¬ 
mation  about  a  program,  but  it  should  also  operate  in  the  "obvious  and  least  expensive”  mode  whenever 
information  is  inadequate.  Our  approach  to  the  parallelism  packaging  and  scheduling  problem  is  a  blend  of 
compiler  and  run-time  schemes.  Figure  4  shows  the  different  components  of  our  framework  as  parts  of 
what  we  call  an  auto-tclicduling  compiler  [Poly88). 

Partitioning  and  qualification  of  parallelisms  This  phase  is  responsible  for  partitioning  the  code  (and/or 
data  structures)  of  a  program  into  identifiable  modules  which  are  treated  as  units  when  it  comes  to  firing. 
For  example,  compiling  vector  or  VLIW  instructions,  or  grouping  a  set  of  instructions  into  a  unit  is  part  of 
partitioning.  ProgTam  partitioning  can  be  done  statically  by  the  compiler.  In  a  fully  dynamic  environment 
(e.g.,  dataflow)  partitioning  is  implicit  and  depending  on  our  execution  model,  an  allocatable  unit  can 
range  from  a  single  instruction  to  a  set  of  instructions.  Static  (explicit)  partitioning  is  desirable  because  it 
exploits  readily  available  information  about  program  parallelism  and  can  consider  overhead  and  other  per¬ 
formance  factors.  In  our  case  we  use  a  semi-static  partitioned  the  formation  of  tasks  is  based  on  the  syn¬ 
tax  of  the  language  and  other  program  information,  but  a  task  is  allowed  to  be  decomposed  into  subtasks 
dynamically  during  program  execution.  The  result  of  the  partitioner  is  a  program  task  graph  with  nodes 


Figure  S.  (a)  A  task  graph,  (b)  The  graph  with  entry  and  exit  blocks. 
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corresponding  to  tasks  (instruction  modules)  and  arcs  representing  data  and  control  dependences.  Another 
activity  of  the  partitioner  is  to  perform  overhead  analysis  and  determine  where  tasks  need  to  be  split  or 
merged  in  order  to  effectively  amortise  the  overhead  that  arises  from  dispatching  and  synchronising  the 
components  of  such  tasks. 

Pre-teheduling:  After  a  program  is  partitioned  into  a  set  of  identifiable  tasks,  the  compiler  can  enforce  cer¬ 
tain  scheduling  restrictions  based  on  its  knowledge  about  the  program.  In  the  presence  of  dynamic  schedul¬ 
ing,  pre-scheduling  is  necessary  to  eliminate  race  conditions  which  may  result  in  performance  degradation. 
Such  conditions  arise  due  to  implementation  of  task  queues  and  their  access  protocols.  In  the  case  of  a  sin¬ 
gle  task  queue  and  under  the  assumption  that  parallel  tasks  remain  queued  until  all  processes  are  spawn 
(which  is  often  the  case  in  real  implementations),  race  conditions  arising  during  queueing  operations  may 
create  "artificial"  performance  bottlenecks.  Consider  for  example  the  case  where  P  Idle  processors  start 
executing  a  program  consisting  of  a  large  serial  taek(i)  and  a  parallel  task  taekfp).  If  task{p)  is  queued  in 
front  of  taek(e),  and  assuming  that  task(p)  is  large  enough  to  spread  across  all  P  processors,  then  taek(t) 
will  execute  only  after  the  parallel  task  has  completed,  and  thus  P—1  processors  will  remain  idle  during 
the  durations  of  task(i  )’s  execution  time.  Such  scenarios  can  arise  in  more  complex  task  graphs  as  well. 
By  giving  priority  to  serial  tasks  and  delaying  parallel  tasks  for  as  long  as  all  processors  are  busy,  such 
performance  hasards  can  be  eliminated. 

Dynamic  tack  scheduling  (nested  parallel  tasks):  Scheduling  at  the  task  level  is  then  performed  dynamically 
during  execution.  Traditionally,  dynamic  task  scheduling  has  been  supported  through  the  operating  system 
directly,  or  indirectly  through  the  run-time  system  (e.g.,  earlier  versions  of  Cray  multitasking).  The  major 
drawback  to  this  approach  is  the  enormous  overhead  involved  in  OS  invocations. 

The  auto-scheduling  approach  proposed  in  [Poly88]  is  based  on  the  idea  that  the  program  itself  is 
responsible  for  packaging  and  managing  its  parallelism,  and  thus  the  processors  allocated  to  it.  Under  an 
ASC  environment,  the  compiler  which  has  access  to  vital  program  information,  generates  drive-code  tot 
each  task  in  the  task  graph  representation  of  a  program.  Each  task  is  instrumented  with  an  entry-block 
(ENT-BLK)  and  an  exit-block  (EXT-BLK).  An  example  of  a  program  task  graph  is  shown  in  Figure  5a. 
After  drive-code  generation  the  same  graph  is  instrumented  as  in  Figure  5b.  This  drive-code  is  responsible 
for  recording  precedence  relations,  enforcing  synchronisation,  queueing  and  dequeueing  tasks,  and  decom¬ 
posing  parallel  tasks  into  their  constituent  processes. 

Tasks  are  treated  as  units  of  execution.  Tasks  which  are  ready  to  execute  are  queued  in  a  ready-task 
queue.  The  ready-queue  is  also  a  data  structure  created,  owned,  and  manipulated  by  each  user-program. 
Each  idle  processor  tries  to  dispatch  the  next  available  task  from  the  queue  (if  any).  Also,  tasks  are  queued 
and  thus  are  qualified  for  execution  as  soon  as  they  become  "ready".  It  is  important  to  note  that  parallel 
tasks  are  dynamically  decomposed  into  a  number  of  smaller  processes  whose  sise  and  requirements  depends 
on  the  scheduling  scheme  used.  For  example  parallel  loops  can  be  decomposed  under  the  GSS  algorithm  to 
achieve  load  balancing  while  keeping  run-time  overhead  low.  In  a  simplified  scenario  each  physical  proces¬ 
sor  executes  the  following  loop. 

LOOP  FOREVER 

-  pick  front  queue  image; 

-  load  program  counter; 

-  execute; 

END  LOOP 

Part  of  task  execution  involves  the  execution  of  the  drive-code  in  the  task  exit-block  which  updates  other 
tasks  and  queues  their  corresponding  images.  A  typical  task  exit-block  which  is  shown  below 

EXIT-BLOCK: 

Task-dependent  module: 

-Barrier  synchronization: 

-Select  processor  to  dequeue  current  task; 
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-Select  processor  to  execute  the  following; 

Task-independant  module: 

FOR  (all  successors  of  current  task)  DO 

-Update  (dependences  of)  successors; 

-Queue  freed  successors; 

ENDFOR 

consists  of  two  modules,  the  task-dependent  and  task-independent  modules.  The  former  is  code  that  the 
compiler  generates  to  perform  barrier  synchronisation  (in  the  case  of  parallel  tasks),  select  the  processor  to 
dequeue  the  corresponding  task  from  the  qneue  (e.g.,  the  processor  to  dispatch  the  last  iteration  of  a  loop), 
and  select  the  processor  to  execute  the  second  module  of  the  exit-block  (in  a  parallel  task  that  processor 
will  be  the  one  to  clear  the  barrier).  The  latter  module  is  independent  of  the  type  of  the  task.  Only  a 
specific  processor  (selected  by  the  first  module)  executes  this  part.  The  code  in  this  module  updates  the  pre¬ 
cedence  relations  and  queues  those  successors  that  hare  no  pending  predecessors.  Similarly,  entry-blocks 
are  organised  in  two  modules,  a  task-independent  and  a  task-dependent  module  as  shown  below. 

ENTRY-BLOCK: 

Task-independent  module: 

-Allocate  private  variables  and/or  stack  space; 

-Copy  parent  stack  (optional) ; 

Task-dependent  module; 

-Execute  initialization  code  (if  any)  ; 

-Compute  number  of  iterations  for  this  processor: 

-Update  loop  indices; 

Loop  (or  parallel  taek)  scheduling*.  Upon  queueing,  a  serial  task  ia  dispatched  at  once  as  soon  as  a  processor 
becomes  idle.  However,  a  parallel  task  can  draw  several  processors  to  execute  it  and  thus  it  remains  queued 
until  exhausted.  The  moat  frequent  and  important  type  of  parallel  tasks  are  parallel  loops.  Section  7.3.1 
discussed  and  compared  existing  and  new  approaches  to  dynamic  scheduling  of  parallel  loops.  These  loop 
scheduling  schemes  can  be  viewed  as  dynamic  partitioning  of  a  loop  into  a  set  of  allocatable  units. 

Within  a  processor:  We  finally  face  the  problem  of  scheduling  within  a  processor  at  the  fine  granularity 
level.  Packaging  of  parallelism  and  scheduling  at  this  level  is  more  architecture  dependent  than  any  of  the 
earlier  phases  of  program  scheduling  (Nico84j. 

9.  Multiprocessor  Operating  Systems 

Traditionally,  the  terms  "parallel  processing"  or  "multiprocessing"  refer  to  the  parallel  execution  of 
different  components  of  a  single  application.  If  carried  to  the  extreme,  parallel  processing  is  best  realised  in 
a  batch  environment  where  applications  programs  execute  one  after  another,  and  at  any  given  moment, 
only  one  program  executes  on  a  parallel  machine.  At  the  other  extreme,  a  parallel  processor  machine  can 
be  used  as  a  purely  multiprogramming  box  to  booet  throughput  rather  than  individual  turnaround  time. 
Unfortunately,  many  of  today’s  parallel  computers  are  used  in  the  latter  mode  of  operation.  This  is 
mainly  due  to  our  little  experience  and  understanding  about  the  new  operating  system  issues  brought  for¬ 
ward  by  parallel  computers.  Classical  OS  problems  such  as  processor  and  memory  allocation  need  to  be 
reconsidered  for  parallel  machines  [Rath88],  [Poly89]. 

From  the  machine  utilisation  point  of  view  the  best  mode  of  operation  would  be  somewhere  between 
the  two  extremes:  multiprocessing  and  multiprogramming  at  the  same  time.  We  call  this  polymorphous 
processing  or  polyproeeeting.  Few  systems  support  tome  primitive  flavor  of  polyprocessing.  Table  7  gives 
an  operating  system  taxonomy  based  on  three  fundamental  properties.  According  to  this  taxonomy  an 
operating  system  is  classified  bated  on  whether  it  is  a  Single-user  (batch)  or  Multi-user  system,  whether  it 


224 


allows  Preemption  or  Non-preemption,  and  finally  on  whether  it  supports  Serial  or  Parallel  program  exe¬ 
cution. 

In  Table  7  a  three-letter  abbreviation  is  used  to  denote  each  class.  Clearly,  the  case  of  SPS  is  not 
desirable.  All  other  classes  are  viable  cases  including  the  single-uaer/preemptive/parallel  or  SPP.  The 
multi-user/preemptive/serial  or  MPS  class  characterises  the  majority  of  the  operating  systems  of  modern 
parallel  computers.  Even  though  several  machines  are  promoted  as  MPP  systems,  only  in  rare  cases  one 
can  operate  effectively  under  this  mode. 

More  research  needs  to  be  done  in  evaluating  the  merrits  and  drawbacks  of  each  of  these  classes. 
Although  MPP  is  intuitively  the  most  desirable  one,  it  may  not  necessarily  be  the  most  performance- 
oriented  one.  Idealy,  a  multiprocessor  OS  should  be  adaptive ,  i.e.,  able  to  switch  between  several  of  the 
modes  in  Table  7  automatically,  based  on  the  sise  and  the  characteristics  of  the  workload.  This  appears  to 
be  the  most  important  feature  of  future  multiprocessor  operating  systems. 

In  order  to  illustrate  the  significance  (with  respect  to  performance)  of  adaptive  OS,  consider  a  large 
parallel  application  written  e.g.,  in  Cray  or  Cedar  Fortran.  Such  a  program  would  be  heavily  instrumented 
with  calls  to  the  macro / microtasking  library  in  order  to  receive  parallel  execution.  Recall  that  each  call  to 
the  run-time  system  incurs  a  significant  overhead.  If  that  program  is  executed  during  a  high  workload 
period,  it  may  (based  on  its  priority,  demands,  etc),  end  up  executing  serially  (since  a  finite  number  of  phy¬ 
sical  processors  need  to  be  shared  between  a  large  number  of  jobs).  Thus,  all  the  overhead  paid  for  creat¬ 
ing  parallel  processes,  scheduling,  and  performing  meaningless  synchronisation  may  be  pointless.  On  the 
other  hand,  the  same  program  may  execute  overnight  on  an  idle  system,  in  which  case,  it  will  execute  in 
parallel  justifying  the  overhead  of  e.g.,  macro/microtasking.  Of  course,  it  is  impossible  and  impractical  for 
the  user  to  edit  the  code  each  time  the  program  runs.  This  should  be  the  responsibility  of  the  OS:  based  on 
its  knowledge  of  the  workload,  it  can  selectively  deactivate  (some)  parallel  processing  directives  in  certain 
programs,  which  may  end  up  executing  serially  any  way. 

In  order  to  built  such  sophistication  into  an  OS,  compiling  and  operating  system  issues  need  to  be 
considered  simultaneously.  For  instance,  in  an  auto-scheduling  environment,  the  responsibility  of  the  OS 
is  to  direct  idle  processors  to  user  program  queues,  in  order  to  accomplish  multiprogramming.  It  is  then 
the  user’s  program  sole  responsibility  to  manage  the  allocated  processor(s)  on  its  own  tasks.  Some  proces¬ 
sor  allocation  issues  for  adaptive  OS  are  discussed  in  [Poly 89]. 

10.  Conclusions 

The  lack  of  methodologies  and  software  to  support  parallel  programming  is  profound  even  on  the 
most  advanced  parallel  machines.  Parallel  programming  is  a  complex  task  and  the  performance  of  a  paral¬ 
lel  program  can  be  influenced  by  many  different  factors  such  as  coding  of  parallel  constructs  and/or  res¬ 
tructuring,  scheduling  schemes  and  scheduling  overhead,  synchronisation  and/or  communication  cost, 


Single-user 

Multi-user 

Serial 

Parallel 

Serial 

Parallel 

Preemptive 

SPS 

SPP 

MPS 

MPP 

Non-preemptive 

SNS 

SNP 

MNS 

MNP 

Table  7.  A  multiprocessor  operating  system  taxonomy. 


program  and  data  partitioning  and  memory  allocation.  Since  there  ia  no  general  quantitative  or  even 
qualitative  measure  for  these  parameters,  the  process  of  optimising  one  or  more  of  them  is  highly  empiri¬ 
cal,  and  definitely,  application  dependent.  Presently  the  state  of  the  art  necessitates  parallel  programming 
on  a  case-by-case  (i.e.  application)  basis.  This  is  probably  the  most  serious  barrier  in  the  wide-spread  use 
of  parallel  machines  and  programming. 

The  restructuring  compilers  of  the  future  will  undoubtly  need  to  possess  many  properties  that 
presently,  one  finds  only  in  a  handful  of  sophisticated  (but  experimental)  parallelising  compilers.  So  far, 
the  attention  has  only  focused  on  optimisation  and  restructuring  techniques.  However,  the  complexity  of 
the  new  parallel  machines  requires  the  compiler  to  perform  many  additional  functions  than  just  restructur¬ 
ing.  Scheduling  is  a  candidate  for  the  compilers  of  the  near  future.  Memory  management,  minimisation  of 
interprocessor  communication,  synchronisation  and  various  other  types  of  overhead  are  important  issues 
that  could  be  tackled  by  the  compiler.  Another  important  aspect  of  the  near  future  compilers  for  parallel 
machines  is  ease  of  use  and  interaction  with  the  user.  There  are  many  cases  where  the  user’s  assistance  (in 
the  form  of  assertions  for  example)  is  necessary  for  parallelising  a  program  and  exploiting  the  resulting 
parallelism. 

It  ia  very  likely  that  in  the  next  few  years  we  will  see  a  transfer  of  many  run-time  activities  (that  are 
now  considered  the  operating  system’s  responsibility),  to  the  compiler.  This  will  become  necessary  as  per¬ 
formance  becomes  more  of  a  critical  factor.  Any  activity  involving  the  operating  system  is  known  to 
involve  a  large  overhead.  This  overhead  cannot  be  tolerated  above  a  certain  point.  Also,  in  time-sharing, 
systems  knowledge  of  specific  program  characteristics  is  not  necessary  to  achieve  high  throughput.  In 
parallel  processor  environments  however,  knowledge  of  program  characteristics  is  necessary  for  minimising 
program  turnaround  time.  Thus  the  shift  of  operating  system  functions  to  the  compiler  will  be  a  logical 
consequence.  Compilers  will  become  highly  interactive  and  far  more  complex  than  modern  restructure)?, 
while  the  software  layer  between  the  user  and  the  hardware  called  the  operating  system  will  become 
thinner,  at  least  in  high  performance  computer  systems. 

Parallelism  in  algorithms  and  programs  may  be  implicit,  or  may  be  explicitly  specified  at  several 
different  levels.  When  parallelism  exists  in  fixed-eise  "  quantum?",  it  is  rather  easy  to  understand  and 
exploit.  The  unstructured  nature  of  parallelism  makes  its  efficient  exploitation  and  programming  to  be 
complex  tasks.  Devicing  methods  and  tools  that  automatically  perform  these  tasks  is  thus  a  very  impor¬ 
tant  research  subject. 
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1  Presentation  of  EWS 

EWS  (EuroWorkStation)  is  an  ESPRIT  II  project.  Its  objective  is  the  creation  of  a  High  Perfor¬ 
mance  Technical  Workstation. 

The  members  of  the  project  are  Siemens  (D),  BULL  (F),  CHORUS  (F),  GIPSI  s.a  (F),  Grupo 
APD  s.a  (E),  Rutherford  Appleton  Laboratories  (GB),  INRIA  (F),  INESC  (P),  FhG-AGD  of 
Darmstadt  University  (D)  and  Brunei  University  (GB). 

For  our  part,  we  realize  a  Multi-SPARC  Workstation  (with  4  processors)  that  can  be  considered  as 

•  a  Basic  Part  with  one  SPARC  CPU,  which  will  run  standard  ABI  applications, 

•  a  Computational  Extension  (  with  3  other  SPARC  processors).  Extra  computational  power 
will  be  obtained  by  generating  a  code  concurrently  run  by  the  4  processors  that  will  synchro¬ 
nize  and  share  special  data  (contained  in  so-called  tagged  cells)  via  a  dedicated  bus  which 
bypasses  memory  accesses.  To  generate  code  dedicated  to  that  architecture,  we  will  customize 
an  existing  SPARC  Compiler. 

The  MultiSPARC  architecture  will  allow  EWS  to  benefit  from  evolutions  in  the  SPARC  family. 
An  aspect  of  our  participation  is  to  exploit  parallelism  on  this  architecture  ;  this  article  focuses  on 
this  aspect  and  more  precisely  on  the  parallel  execution  of  Fortran  programs. 
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2  Parallelization  of  Fortran  Programs 

2.1  Tools  for  Automatic  Parallelization 

Many  Fortran  dialects  and  many  Fortran  program  restructuring  tools  flourished  these  last  years 
along  with  the  wave  of  parallel  processors. 

Although  one  may  claim  that  making  sequential  Fortran  programs  run  efficiently  on  a  parallel 
architecture  is  easy,  a  simple  test  can  outline  the  difficulty  of  the  job  on  actual  architectures  [3]. 
On  the  other  hand,  designing  a  Parallel  Fortran  common  to  the  various  parallel  architectures  is 
still  a  “work  in  progress”  [5]. 

Our  objectives  were  not  to  design  another  parallel  Fortran  dialect  but  design  and  implement  tools 
for  transforming  sequential  Fortran  applications  into  parallel  Fortran  applications  and  allow  them 
to  run  on  a  Multi-SPARC  workstation. 

In  the  EWS  Project,  we  intend  to  realize  a  Parallelizer  and  a  Parallel  Fortran  Compiler. 

•  The  Parallelizer  analyses  the  DO  loops  of  a  Fortran  program  to  detect  parallelism  indepen¬ 
dently  on  the  architecture  of  the  machine. 

The  output  of  the  Parallelizer  will  be  a  Fortran  Program  where  sequential  and  parallel  re¬ 
gions  are  distinguished  and  where  synchronization  barriers  are  minimized  by  using  techniques 
similar  to  the  ones  presented  in  [1].  BULL  is  in  charge  of  the  realization  of  the  interactive 
parallelizer. 

•  The  Compiler  uses  the  results  of  the  Parallelizer  to  implement  parallel  execution  on  a  Multi- 
SPARC  architecture.  GIPSI  is  in  charge  of  the  realization  of  Parallel  Fortran  compiler 

This  work  is  done  in  cooperation  with  the  ESPRIT  project  GIPE  which  builds  an  interactive 
programming  environment  for  Fortran. 

2.2  Output  of  the  Parallelizer 

Currently,  as  a  parallelizer,  we  use  a  prototype  issued  from  VATIL  which  is  an  automatic  vectorizer 
created  by  INRIA  [4],  This  vectorizer  was  written  in  LeLisp  [2]  and  used  for  various  targets 
including  the  vector  extension  of  DPX-1000  ;  it  uses  a  symbolic  analysis  of  dependences  which  was 
partly  described  in  [4],  allowing  full  use  of  information  given  by  the  programmer.  With  respect 
with  [4],  the  method  for  computing  dependences  has  been  changed,  so  that  integer  programming 
problems  are  solved  with  the  help  of  a  simplex  method  ;  the  resulting  heuristics  yields  fairly 
reasonnable  times  for  dependence  computation. 

The  output  of  the  Parallelizer  is  a  Fortran  source  where  parallelizable  Fortran  DO  loops  are  trans¬ 
formed  into  a  list  of  parallel  and  sequential  DO  loops.  There  are  two  possibles  levels  of  output  : 

•  one  called  user-level  output  for  interaction  with  the  user,  displaying  the  distribution  of  the 
original  loop  into  parallel  and  sequential  loops,  as  well  as  the  constraints  imposed  on  execution 
by  dependences  ;  we  plan  to  modify  the  syntax  of  this  output  towards  PCF  Fortran  [5]. 

•  an  other  called  compiler-level  output  to  be  read  by  the  compiler  ;  this  compiler-level  ouput 
contains  the  same  informations  as  the  user-level  output  but  with  mechanisms  dedicated  to 
the  execution  on  EWS. 

For  example,  with  the  input  : 
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dimension  z(100),  jr(100),  z(100) 
diaension  t(100),  v(100) 
diaension  u(100),  a(100) 


dO  1  i  :  1,1 

x(i)  =  y(i)  +  z(i) 
u(i)  =  x(i+l)  ♦  »(i) 
y(i)  =  »(i)+y(i+l) 
u(i)  =  v(i)+u(i+l) 

z(i)  =  3  *  y(i) 
t(i)  =  t(i)+o(i) 
a(i)=  s(i-l)+u(i)*y(i) 

1  continue 

and 

The  parallelizer  produces  the  following  user-level  output  (the  compiler-level  output  will  be  shown 
in  the  forthcoming  paragraphs)  : 

CtPARALLEL  REGIOI 

CtPARALLEL  DO  1 
CIDIRECTIVE  LOCAL={i> 
do  1  i=l,n 
u(i)  =  w(i)*z(l+i) 
t(i)  =  t(i)+w(i) 

1  continue 

CtPARALLEL  DO  2  AFTER  1 
C*DIRECTIVE  LOCAL=fi> 
do  2  isl,n 
x(i)  =  y(i)+z(i) 

2  continue 

CtSEQUEITIAL  DO  3  AFTER  2 
CIDIRECTIVE  LOCAL=fi> 
do  3  i=l,n 
y(i)  =  »(i)+y(l+i) 
n(i)  =  u(l+i)+w(i) 
a(i)=  e(i-l)+u(i)»y(i) 

3  continue 

CtPARALLEL  DO  4  AFTER  3 
CIDIRECTIVE  L0CAL={i> 
do  4  i=l,n 
z(i)  =  3*y(i) 

4  continue 

CIEID  PARALLEL  REGIOI 

In  a  PARALLEL  DO  loop,  all  the  iterations  of  the  loop  are  independent  and  can  be  executed  in 
any  order  without  any  synchronization. 

In  a  SEQUENTIAL  DO  loop,  there  are  dependences  between  the  iterations  and  the  iterations  must 
be  done  sequentially. 
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2.3  Extensions  to  Fortran  77 


The  compiler-level  ouput  of  the  preprocessor  is  expressed  in  an  extended  Fortan  77  where  the  main 

extensions  are  : 

•  local  variables  :  these  are  variables  local  to  a  section  of  code,  and  instantiations  of  that 
section  of  code  running  concurrently  have  their  own  instantiation  of  these  variables, 

•  reduction  variables  :  these  variables  must  be  acceded  by  sections  running  concurrently 
with  mutual  exclusion, 

•  multithreading  operations  :  these  operations  allow  several  processors  to  cooperate  in  the 
execution  of  a  parallel  region  ;  they  are  done  via  subroutine  calls  which  will  be  inlined  by  the 
compiler  into  SPARC  instructions. 

2.4  Impact  on  the  Fortran  Compiler 

We  decided  to  adapt  the  Sun  4  Fortran  Compiler  and  to  minimize  our  interventions  into  the 

compiler  by  using  dedicated  pre-  and  postprocessors  ; 

We  made  that  choice  because  many  SPARC  compilers  are  in  development  and  we  hope  that  this 

strategy  will  allow  us  to  adapt  our  works  on  any  of  these  compilers  at  a  low  cost. 

The  two  main  interventions  done  in  the  compiler  are  : 

•  protect  the  code  associated  with  parallel  regions  from  optimizations,  meaningful  for  a  mono- 
SPARC  but  erroneous  for  a  Multi-SPARC, 

•  generate  specific  code  which  controls  access  to  tagged  cells. 

2.5  Parallel  Execution  of  Fortran  Programs 

2.5.1  Mechanisms  for  Communication  and  Synchronization 

•  A  task  is  an  execution  environment  and  the  basic  unit  of  resource  allocation.  A  task  includes 
a  virtual  address  space  for  code  and  data.  A  Fortran  program  is  a  task. 

•  A  thread  is  the  basic  unit  of  execution.  It  consists  of  all  processor  states  necessary  for 
independent  execution  (e.g.  hardware  registers).  A  thread  executes  on  the  virtual  address 
space  of  a  task.  A  SEQUENTIAL  DO  loop  will  be  executed  on  one  thread  and  a  PARALLEL 
DO  loop  will  be  split  into  several  threads. 

•  A  cell  is  the  basic  mechanism  for  communication  and  synchronization  between  threads.  A 
cell  contains  a  data  plus  a  tag  which  indicates  whether  the  data  is  available1  or  not.  The  tag 
of  a  cell  has  the  same  value,  at  any  time,  for  all  the  processors.  The  cells  are  implemented  in 
a  special  memory  space  called  the  synchronization  space. 

There  are  three  atomic  operations  on  a  cell  : 

•  LOCK 

syntax  :  d»LOCK(cell) 
semantics  : 


wait  until  tag(call)  »■  available 


1 ‘available”  meant  ‘readable  at  that  moment”. 
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d> value (cell) 

tag  (cell ) “unavailable 

e  UNLOCK 

syntax  :  call  UNLOCK (cell ,d) 
semantics  : 

value ( cell )»d 
tag(cell)>available 

•  GET 

syntax  :  d  «  GET(cell) 
semantics  : 

aait  until  tag(cell)  ■■  available 
d*value(cell) 

Remarks  : 

1.  Each  atomic  operation  is  inlined  into  one  RISC  instruction. 

2.  The  operations  LOCK  . . .  UNLOCK  permit  an  access  with  mutual  exclusion  to  data  and 
sections  of  code. 

3.  An  UNLOCK  not  preceeded  by  a  LOCK  means  an  initialization  of  a  cell  (this  usually 
occurs  in  a  monoprocessor  section  just  before  entering  a  multiprocessor  section). 

4.  An  access  via  GET  to  a  cell  means  an  access  to  a  volatile  value  of  the  cell. 

2.5.2  Execution  of  a  Parallel  Region 

The  output  of  the  Parallelizer  is  organized  in  regions  ;  some  of  them  are  parallel  regions  which  can 
be  concurrently  executed  by  severals  processors  j  the  other  regions  will  be  executed  by  only  one 
processor  which  is  called  the  main  processor. 

While  the  main  processor  is  running  a  non  parallel  region,  the  other  processors  sire  idle  waiting  for 
the  address  of  the  next  parallel  region  to  be  run  concurrently.  That  address  will  be  posted  in  a 
special  cell  called  multi  by  the  main  processor. 

The  run-time  context  of  a  processor  is  divided  in  two  parts  : 

•  a  general  context  which  is  shared  by  all  the  processors  and  composed  of  the  executable  code 
and  the  data  in  memory  and  cells, 

•  a  private  context  which  is  specific  to  each  processor  and  composed  of  its  registers  and  its 
stack. 

The  distribution  of  the  execution  on  the  several  processors  brings  about  data  transfers  between  the 
private  context  of  processors.  One  of  the  tasks  of  the  code  generation  is  to  minimize  these  tranfers 
and  to  correctly  initialize  the  private  region  of  each  processor. 

A  parallel  region  is  a  sequence  of  PARALLEL  and  SEQUENTIAL  DO  loops  to  which  are  associated 
threads  that  constitute  the  minimal  execution  unit  that  processors  concurrently  try  to  execute. 
To  a  SEQUENTIAL  DO  loop  is  associated  one  thread  ;  to  a  PARALLEL  DO  loop  are  associated 
several  threads. 

To  each  loop  of  a  parallel  region  is  associated  a  cell,  the  value  of  which  indicates  whether  the  loop 
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is  ready  to  be  executed  or  not,  and,  in  the  former  case,  which  threads  are  to  be  executed  if  it  is  a 
PARALLEL  loop. 

All  the  processors  concurrently  run  the  code  of  a  parallel  region  and  test  the  cells  to  determine 
the  threads  to  be  executed.  The  compiler  uses  autoscheduling  techniques  inspired  from  [6]  and 
generates  code  that  provides  autoscheduling  of  the  processors  without  any  system  call. 

The  outlines  of  the  main  pieces  of  code  are  : 

code  run  by  the  main  processor  : 


call  IIITMULTI (nb_cell , nb_proc ) 

C  Initialization  of  the  calls  associated  to  the  loops  of  the  region 
C  with  tha  UILOCK  operation 

C  End  of  initialization 

call  MULTIREGIOI (region, . .) 

where 


subroutine  IIITMULTI (nb_eall.nb_proe) 

C  nb.proc  is  the  number  of  processor  offered  by  the  machine 
99  if  (GET(nextmulti  .neq.  0)  goto  99 
UILOCK(neztmulti ,nb_proc-l) 

C  nextmulti  :  when  the  value  of  this  cell  is  0,  its  means  that  all  the 
C  auxilliary  processors  are  ready  for  the  execution  of  the 

C  next  multiregion. 

UIL0CK ( endmult i , nb.proc ) 

C  endmult i  :  when  the  value  of  this  cell  is  1,  its  means  that  all  the 
C  auxilliary  procesaoxa  hava  f iniahad  the  execution  of  the 

C  current  multiregion, 

end  IIITMULTI 

subroutine  MULTIREGIOI (region, .. ) 

C  storage  of  tha  parameters  in  exchange  zone  (tagged  calls) 

C  End  of  the  storage 

UIL0CK(nult i .Cregion) 
call  flmulti 

99  if  (GET(andmulti)  .neq.  1)  goto  99 
LOCK (multi) 
n  «  LOCK ( endmult i) 

UIL0CX ( endmult i , n- 1 ) 
end  MULTIREGIOI 


code  concurrently  run  by  all  the  processors  : 

subroutine  region( . . . ) 

C  stop  is  a  cell,  the  value  of  which  becomes  equal  to  i 
C  when  all  the  threads  have  been  selected 


9999 


-  PARALLEL  and  SEQUEITIAL  loops 

if (GBT(atop) . le.O)  goto  9999 
end  region 
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code  run  by  the  auxiliary  processors  : 

This  code  is  written  in  SPARC  assembly  language  ;  a  Fortran-like  version  is  : 
999  addr*GET(anlti) 

. . .  -  Initialization  of  th«  privata  context 

-  iron  dedicated  cells 

call  Caddr 

n»LOCK ( endanlt i ) 

call  UNLOCK ( endanlt i , n- 1 ) 

99  if  (GET(endaolti)  .neq.  0)  goto  99 

n-LOCK(nextanlti) 

UILOCK(nextaulti.n-l) 
goto  999 


2.5.3  Algorithms  to  generate  synchronization  between  the  loops  of  a  region 

One  must  keep  in  mind  that  : 

e  the  code  between  CNPARALLEL  REGION  and  CtEHD  PARALLEL  REGION  will  be  run  concurrently 
by  the  different  processors  allocated  to  the  Fortran  program, 
e  a  SEQUENTIAL  DO  loop  is  associated  one  thread, 

a  a  PARALLEL  DO  loop  is  split  into  several  threads  ;  this  number  of  threads  is  not  bounded 
by  the  number  of  processors  allocated  to  the  Fortran  program  ;  the  threads  of  a  PARALLEL 
DO  loop  are  independent  of  each  other, 

•  the  whole  synchronization  is  realized  via  access  to  cells  and  without  any  system  call. 
Region  with  only  SEQUENTIAL  DO  loops 

Let  us  consider  the  previous  example  where  the  PARALLEL  DO  loops  are  considered  as  SEQUEN¬ 
TIAL  DO  loops  : 

CNPARALLEL  REGION 

consequential  do  i 


CONSEQUENTIAL  DO  2  AFTER  DO  1 


CONSEQUENTIAL  DO  3  AFTER  DO  2 


CONSEQUENTIAL  DO  4  AFTER  DO  4 


CNEND  PARALLEL  REGION 

The  compiler  transforms  the  region  following  the  hereafter  algorithm  : 
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•  A  cell  is  associated  with  each  DO  loop3. 

•  If  a  loop  is  independent,  its  cell  is  initialized  to  1. 

•  If  the  loop  DO  n  depends  on  p  other  DO  loops,  the  cell  associated  to  DO  n  is  initialized  to 

1  -p. 

•  There  is  an  extra  cell  called  stop  which  is  initialized  to  a  value  equal  to  minus  the  number 
of  leaf-loops  plus  one.  This  cell  is  decremented  every  time  a  leaf-loop  completes. 

•  When  a  loop  completes,  the  cells  of  the  loops  depending  on  that  loop  are  incremented  by  1 
via  a  multithreading  operation  SIGCHORE(cell). 

•  When  the  value  of  a  cell  becomes  equal  to  1,  the  associated  loop  becomes  eligible. 

•  When  a  loop  is  selected,  its  cell  is  reset  to  0. 

•  The  test  of  the  eligibility  and  the  loop  selection  is  done  via  a  multithreading  operation 
ISREADY(cell). 

All  the  processors  allocated  to  the  task  (i.e.  the  Fortran  program)  try  concurrently  to  execute  the 
threads  (i.e  the  eligible  DO  loops).  When  stop  becomes  equal  to  1,  all  the  processors  but  one 
become  idle  waiting  for  the  next  region  to  be  concurrently  executed. 

For  example,  the  loop  DO  3  will  be  changed  into 

C$SEQUETTIAL  DO  32 

if  (ISREADYO)  .gt.  0)  then 
do  32  i=l,n 

32  continue 

call  SIGCH0R£(4) 

sndif 


The  multithreading  operations  are  written  in  SPARC  assembly  language  and  inlined.  Their  mean¬ 
ing  -  in  Fortran  -  is  : 

function  ISREADY(cell) 
integer  isready 
isready>  LOCK (cell) 
if  (isready  .gt.  0)  then 

call  UNLOCK(cell , isready- 1) 
else 

call  UNL0CK(cell, isready) 

endif 

return  isready 
end 


subroutine  SIGCHORE(cell) 
integer  status 
status*LOCX(cell) 


Jln  all  the  example*,  the  cell  p  i*  associated  to  the  *do  p"  loop. 
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call  UNLOCK (call , status* 1 ) 
end 


Region  with  only  a  PARALLEL  DO  loop 

A  PARALLEL  DO  loop  is  split  into  several  threads  to  be  concurrently  executed  by  processors 
allocated  to  the  task.  There  are  different  strategies  to  fix  the  number  of  threads  and  the  scheduling 
of  the  iterations  between  the  threads. 

If  the  loop  is  a  genuine  parallel  loop,  there  are  two  strategies  to  fix  the  number  of  threads  : 

•  If  one  knows  that  the  computing  time  is  approximately  the  same  for  each  iteration  of  the 
loop  (e.g.  a  loop  with  no  conditional  instruction),  then  the  number  of  threads  p  will  be  set  to 
the  number  of  processors  allocated  to  the  task  and  each  thread  will  execute  n/p  iterations. 

•  If  one  knows  that  the  computing  time  varies  significantly  between  the  n  iterations,  a  better 
balance  of  the  work  between  the  processors  executing  the  threads  will  be  obtained  if  there 
are  more  threads  executing  less  iterations. 


In  both  cases,  each  thread  executes  a  chunk  of  consecutive  iterations  (the  size  of  a  chunk  depends 
on  the  number  of  iterations  and  on  the  number  of  threads  executing  concurrently  the  loop). 

The  genuine  parallel  ’  3i  becomes  : 

CtPARlLLEL  DO  1 

status-  :&EJLDY(1) 
if  (status  .gt.  0)  than 

■yfirst-l+chunk* (status-1) 
aylaat*  ain(ayf irst+ehuak-1 ,n) 
do  1  i=nyf irst .aylast 

1  continus 

audit 


Remark  : 

In  some  cases,  the  number  of  threads  is  imposed  by  dependences  between  iterations.  For  example, 
if  we  change  the  instruction  of  the  loop  DO  2  into  : 

x(i)»  y(i)+z(i)+x(i-3) 

then  this  loop  is  parallelizable  if  and  only  if  3  threads  execute  one  iteration  every  3  ones. 

So  the  Parallelizer  will  indicates  : 

Ct  PARALLEL  DO  2  AFTER  DO  1  -  THREAD-3 

This  number  of  threads  imposed  by  dependences  between  iterations  is  not  limited  by  the  number 
of  available  processors. 

We  will  call  this  kind  of  loop  pseudo  parallel  loop  because  the  iterations  sire  not  independent  and 
the  parallelism  is  obtained  via  an  artefact. 
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A  pseudo  parallel  loop  must  be  split  into  p  threads  and  each  thread  executes  one  iteration  every  p 
iterations. 

The  pseudo  parallel  loop  becomes  : 

CtPA&AIXBL  DO  2  -THREAD=p 
•tatua-ISREADY(2) 
if  (status  ,gt.  0)  than 
do  1  i=status,n,p 

1  continus 

sndif 

Region  with  SEQUENTIAL  and  PARALLEL  DO  loops 

The  algorithm  for  transforming  such  a  region  is  a  generalization  of  the  former  algorithm  used  for 
regions  with  only  SEQUENTIAL  DO  loops  : 

s  A  cell  is  associated  with  each  DO  loop. 

a  If  a  loop  is  independent,  its  cell  is  initialized  to  the  number  of  its  associated  threads, 
s  If  the  loop  DO  n  depends  on  p  other  threads,  the  cell  associated  to  DO  n  is  initialized  to 

1-p. 

a  There  is  an  extra  cell  called  stop  which  is  initialized  to  a  value  equal  to  minus  the  number 
of  leaf-threads  plus  one.  That  cell  is  decremented  every  time  a  leaf-thread  completes, 
a  When  a  thread  completes,  the  cells  of  the  loops  depending  on  that  thread  are  incremented 
by  1. 

a  When  a  loop  becomes  eligible,  if  that  loop  is  a  PARALLEL  loop  then  its  cell  is  set  to  the 
number  of  threads  into  which  the  loop  is  split  and  if  it  is  a  SEQUENTIAL  loop  it  is  set  to  1. 
a  Each  time  a  loop  is  selected,  its  cell  is  decremented  by  1  and  a  thread  is  executed. 

This  implies  the  introduction  of  an  other  multithreading  operation  :  to  this  effect  we  add  an 
optional  parameter  to  SIGCHORE(cell)  with  default  value  1  ;  this  represents  the  number  of  threads 
associated  to  the  loop  when  it  becomes  eligible.  Its  semantics  are  : 

aubroutina  SIGCHORE(call,valua) 
intagar  atatua 
statua*LOCK ( call ) 

If  (atatua  .It.  0)  than 

call  UNLOCK (cell, at atua+1) 

alaa 

call  UNLOCK(call, value) 

andif 

and 

Note  that  when  value  =  1,  SIGCHORE(call,l)  has  the  same  meaning  as  the  previously  given 
semantics  for  SIGCHORE(cell),  because  istatus  can  never  become  strictly  positive. 
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A  Comprehensive  Example 
Let  the  result  of  the  Paxallelizer  be  : 

CtPlUIUL  REGION 
CSPARALLEL  DO  1 

CSPARALLEL  DO  2  AFTER  DO  1  -  THREADS  -  3 

cysEqoOTUL  do  3t  after  do  2 

CJSEQUEITIAL  DO  32  AFTER  DO  1 

CJSEQUE1TIAL  DO  33  AFTER  DO  31,  DO  32 

CSPARALLEL  DO  4  AFTER  DO  31 


CSEND  PARALLEL  REGIOI 


Each  SEQUENTIAL  DO  loop  will  have  one  thread. 
The  loop  DO  2  will  be  split  into  3  threads. 

The  loops  DO  1  and  DO  4  will  be  split  into  4  threads. 
The  whole  region  will  become  : 


CSPARALLEL  REGIOI 

call  UIL0CK(1,4) 

call  UIL0CK2.-3) 

call  UIL0CK31.-2) 

call  OILOCK32.-3) 

call  OILOCK33.-1) 

call  OIL0CK4.0) 

call  UILQCK(stop,-4) 

call  IHTHULTI (xsgion , . . .) 

CSEID  PARALLEL  REGIOI 

subroutine  region(. . . 

CSLOCAL  i,  status,  ajiirst,  nylast,  chunk 

chunk- (n-l)/4  +  1 
9999  continue 
CSPARALLEL  DO  1 

status-ISREADT(l) 
it  (status  .gt.  0)  thou 

■ylLrst-l+chunk» (status- 1) 
nylast-ninCn  ,ny 1  irst+chunk- 1) 
do  1  i-  njrlirst, nylast 

1  cost inns 

call  S1GCH0RE(2,3) 


245 


call  SIGCH0RE(32) 
end  if 

ClPARALLEL  DO  2  AFTER  DO  1  -  THREAD-3 
status-ISREADY (2) 
if  (status  .gt.  0)  then 
do  2  i-status,n,3 

2  continue 

call  SICH0RE(31) 
end  if 

dSEQUETTIAL  DO  31  AFTER  DO  2 

if  (ISR£ADY(31)  .gt.  0)  then 
do  31  i-l,n 

31  continue 

call  SI0CH0RE(4,4) 
call  SIGCH0RE(33) 
end  if 

CISEQUEMTIAL  DO  32  AFTER  DO  1 

if  (ISREADYC32)  .gt.  0)  then 
do  32  i-l,n 

32  continue 

call  SIGCH0REC33) 
end  if 

ClSEQUEVTIAL  DO  33  AFTER  DO  31,  DO  32 
if  (ISREADYC33)  .gt.  0)  then 
do  33  i-l,n 

33  continue 

call  SIOCHORE(stop) 
end  if 

CIPARALLEL  DO  4  AFTER  DO  31 
status-ISREADY (4) 
if  (status  .gt.  0)  then 

■yfirst-l+chunk* (status- 1) 
aylast-Bin  (n ,  syf  ixst+chunk- 1 ) 
do  4  i»  syf ire t.sy last 


4  continue 

call  SICCHORE(stop) 

end  if 

if  (QET(stop)  .le.  0)  goto  9999 
end 


3  Conclusion 

We  have  completed  a  prototype  version  of  the  parallel  compiler  where  the  multithreading  is  im¬ 
plemented  via  lightweight  processes  on  a  SUN  4.  This  version  allows  us  to  check  the  results  of  the 
parallelizer  and  of  the  compiler. 

During  the  second  quarter  of  1990,  a  MultiSparc  EWS  prototype  will  be  available  to  evaluate  our 
multithreading  mechanisms  on  actual  hardware. 

We  also  plan  to  extend  our  multithreading  mechanisms  to  the  parallelism  defined  in  PCF  Fortran 
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and  to  design  a  paraUelizer  that  takes  as  input  Fortran  77  programs  and  produces  PCF  Fortran 
programs. 
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Abstract 

In  this  papet  tome  of  the  key  issues  are  described  for  the  imple¬ 
mentation  of  efficient  sparse  computational  codes  on  supercomputers. 
Whereas  dense  computations  run  mostly  near  the  peak  performance  of 
new  architectures,  for  sparse  computations  this  is  certainly  not  true. 
Sparse  matrix  computations  ate  characterised  by  the  relative  small 
number  of  operations  per  data  element  and  the  irregularity  of  the  com¬ 
putation.  Both  facts  may  significantly  increase  the  overhead  time  due 
to  memory  traffic.  Further  the  development  of  sparse  code  is  far  from 
trivial.  The  use  of  sophisticated  data  structures  together  with  complex 
control  flow  makes  designing  sparse  codes  an  almost  unmanageable 
task. 


1  Introduction 

A  major  difficulty  associated  with  sparse  matrix  computations  is  that  the 
relation  hardware-algorithm  is  complex  because  of  the  random  nature  of 
the  computations.  Sparse  matrix  computations  are  characterized  by  several 
features:  (i)  complex  data  handling,  (ii)  irregular  data  streams,  (iii)  indirect 
addressing,  and  (iv)  a  low  ratio  of  arithmetic  operations  to  data  element 
references.  The  first  feature  is  caused  by  the  fact  that  sparse  matrices  are 
stored  in  a  condensed  format  in  order  to  minimize  the  storage  requirements, 

’This  work  was  supported  in  put  by  the  National  Science  Foundation  under  Grant 
No.  US  NSF  CCR-8717943,  and  the  US  Department  of  Energy  under  Grant  No.  US  DOE 
DE-FG02-85ER25001. 
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which  leads  to  substantial  overhead.  The  second  leads  to  lower  effective 
memory  bandwidth,  the  third  complicates  vectorisation  and  parallelization 
of  the  sparse  matrix  codes,  and  the  fourth  can  lead  to  excessive  data  move¬ 
ment  and  loss  of  data  locality. 

As  a  result  the  performance  of  sparse  matrix  computations  is  dictated  by 
the  ability  of  a  specific  architecture  to  support  high  and  irregular  data  traffic 
between  computational  elements  and  memory.  In  fact  the  common  assump¬ 
tion  that  long-running  numerical  application  software  is  CPU-limited  [11] 
is  questionable.  Super/Parallel  Computers  today  are  very  subtly  tuned  for 
matching  the  computational  cycle  with  memory  access  cycle.  This  results 
in  very  good  performance  of  regular  computations  in  which  not  too  many 
data  streams  from  and  to  memory  are  involved.  It  seems  likely  that  this 
trend  will  continue  with  future  systems  becoming  even  more  dependent  on 
regular  data  transfers.  It  appears  that  computational  kernels  in  which  there 
are  at  most  three  regular  data  streams  involved,  can  be  exploited  fairly  well 
by  most  of  the  currently  marketed  supercomputers.  However,  in  some  cues, 
one  can  detect  a  significant  decrease  in  performance  when  there  are  three  or 
more  data  streams  Instead  of  just  two  data  streams  per  operation  [14]. 

Not  only  is  the  performance  of  architectures  stressed  by  sparse  matrix 
computations;  they  also  constitute  a  major  problem  for  restructuring  com¬ 
pilers  and  program  environments.  As  a  major  part  of  loop  structures  in 
sparse  codes  employ  indirect  addressing  (subscripted  subscripts),  data  de¬ 
pendencies  can  mostly  not  be  determined  at  compile  time.  The  only  known 
restructuring  techniques  are  bued  on  an  at  runtime  evaluation  of  the  in¬ 
direction  arrays  [13,16].  The  evaluation  of  these  indirection  arrays  is  too 
costly,  however,  if  the  number  of  operations  performed  on  the  array  ele¬ 
ments  is  small.  For  a  detailed  account  for  the  performance  tradeoff  of  these 
techniques  see  [13].  For  some  instances  of  sparse  codes  the  overhead  due  to 
runtime  evaluation  of  the  indirection  arrays  can  be  nullified.  This  is  in  par¬ 
ticular  true  for  iterative  methods  for  solving  linear  systems  of  equations.  As 
these  methods  involve  a  series  of  matrix-vector  multiplies  (and  possibly  tri¬ 
angular  solves)  where  the  sparsity  structure  of  the  matrix  does  not  change 
from  one  iteration  to  the  other,  the  evaluation  has  only  to  be  performed 
once. 

The  development  of  scientific  libraries  for  efficient  sparse  matrix  com¬ 
putations  codes  is  a  major  effort.  This  is  not  only  due  to  the  fact  that  the 
various  different  architectures  require  very  different  techniques  in  order  to  be 
utilised  efficiently,  but  also  most  existing  sparse  matrix  computation  codes 
cannot  be  viewed  as  consisting  out  of  a  number  of  higher  level  computational 
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primitives.  After  all  the  components  of  a  sparse  matrix  computation  code 
interact  strongly  with  each  other.  For  instance,  in  a  direct  solver  for  systems 
of  linear  equations,  the  handling  of  fill-in  determines  the  storage  format  of 
the  matrix  which  directly  influences  the  reordering  used  in  the  code,  and 
the  choice  of  a  pivoting  strategy  influences  both  the  fill-in  handling  as  the 
reordering.  This  directly  implies  that,  if  higher  level  primitives  can  be  iden¬ 
tified,  these  primitives  need  to  be  implemented  in  various  ways  in  order  to 
be  utilized  in  different  code  instances. 

Another  major  problem  formed  by  sparse  computations  is  the  complex¬ 
ity  of  code  development.  This  is  mainly  caused  by  the  fact  that  the  use  of 
complicated  condensed  formats  for  storing  the  sparse  matrices  forces  the  pro¬ 
grammer  to  implement  explicitly  garbage  collection  routines  to  free  space, 
and  routines  to  reformat  data  structures.  Secondly,  the  exploitation  of  paral¬ 
lelization  and  vectorization  involves  often  sophisticated  preprocessing  code. 
To  illustrate  this,  in  the  following  table  we  counted  for  two  sparse  Fortran 
codes  the  total  number  of  lines  of  code  and  the  number  of  lines  in  the  code 
in  which  floating  point  operations  are  performed. 


!■■■■! 

#  of  actual  lines 

#  lines  with  fl.  point  operations 

Ma28 

28 

McSparse 

KSSHHh 

30 

Ma28  is  a  commonly  used  package  for  solving  general  sparse  systems  of  linear 
equations  developed  at  Harwell  [2].  McSparse  is  a  package  for  solving  these 
systems  which  exploits  different  levels  of  parallelism  and  hierarchical  mem¬ 
ory  systems  [7,19].  As  can  been  from  this  table  the  number  of  lines  which 
contain  floating  point  instructions  is  negligible.  This  does  not  directly  imply 
that  the  amount  of  time  spend  performing  these  floating  points  instructions 
is  negligible.  However,  it  shows  that  most  of  the  effort  of  developing  these 
code  is  spend  in  the  non-computational  part  of  the  code. 

The  above  described  difficulties  are  less  of  an  issue  for  structured  sparse 
computations.  Structured  sparse  matrices  arise  mainly  from  the  numerical 
handling  of  partial  differential  equations  by  either  finite  element  or  finite 
differences  techniques.  These  matrices  differ  from  general  sparse  matrices  in 
the  sense  that  mostly  a  diagonal  storage  scheme  can  be  used,  which  can  be 
exploited  to  obtain  longer  vector  operations.  Also  irregular  data  streams  do 
not  occur  as  frequently  as  for  general  sparse  matrices  which  alleviates  the 
constraints  on  the  effective  memory  bandwidth.  Further,  complex  data  ma¬ 
nipulations  are  not  needed.  Became  of  this,  the  reminder  of  the  paper  deals 
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mainly  with  unstructured  sparse  computations.  In  the  next  two  sections 
some  specific  solutions  for  the  implementation  of  direct  methods  and  Itera¬ 
tive  methods  are  described  for  solving  general  sparse  systems  of  equations 
on  parallel/super  computers. 

2  Direct  Solvers  for  Sparse  Linear  Systems 

Implementations  of  direct  solvers  for  general  sparse  matrices  on  vector  and 
parallel  architectures  mostly  perform  very  poorly.  This  is  mainly  caused  by 
the  fact  that  the  exploitation  of  parallelism  and  vectorlsation  in  such  codes 
is  strongly  dependent  on  the  sparsity  structure  of  the  matrix  A.  In  the  case 
that  the  sparsity  pattern  of  the  matrix  A  is  arbitrary,  it  is  almost  impossible 
to  exploit  any  parallelism/vectorization  a  priori.  There  are  essentially  three 
different  techniques  for  generating  parallelism  in  direct  solvers  for  general 
linear  systems  of  equations.  The  first  technique  exploits  parallelism  at  the 
level  of  the  rank-1  update.  Secondly  multiple  rank-1  updates  can  be  per¬ 
formed  in  parallel.  This  mostly  requires  a  search  for  a  set  of  diagonal  or 
triangular  pivots.  Thirdly  a  global  ordering  (tearing  technique)  can  be  used 
to  decompose  the  sparse  matrix  into  blocks  which  can  be  factored  in  parallel. 

Another  technique,  which  is  recently  being  used  to  obtain  efficient  im¬ 
plementations  of  these  codes,  is  based  on  identifying  sub-systems  which  are 
sufficiently  small  and  have  a  reasonable  amount  of  fill-in.  These  sub-systems 
can  be  treated  as  dense  matrices  and  dense  factorizations  can  be  used  which 
are  mostly  much  more  efficient  than  sparse  factorizations.  A  good  example 
of  this  development  is  given  by  the  multifrontal  code  [4].  In  this  code  frontal 
matrices  are  assembled  by  examining  the  elimination  tree.  However,  the  ef¬ 
ficiency  of  the  multifrontal  approach  is  degraded  if  the  pattern  of  the  sparse 
matrix  is  far  from  symmetric. 

Although  tearing  techniques  are  mostly  considered  for  introducing  large 
grain  parallelism  in  a  sparse  solver,  they  also  identify  subsystems  in  which 
the  amount  of  fill-in  is  reasonable  large.  Because  tearing  techniques  bring 
a  sparse  systems  into  bordered  upper  triangular  block  form,  fill-in  will  be 
concentrated  to  the  diagonal  blocks  and  the  border.  This  is  in  particular  true 
for  the  “coupling  block”,  which  in  most  cases  will  get  near  dense  during  the 
factorization.  In  the  remainder  of  this  section  a  direct  solver,  McSparse,  is 
being  discussed  which  is  currently  being  developed  at  CSRD.  The  underlying 
idea  for  developing  this  solver  was  to  obtain  a  sparse  code  which  would 
exploit  different  levels  of  parallelism.  As  new  architectures  are  getting  more 
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Figure  1:  The  ordering  H* 

and  more  hierarchically  ftructured  it  will  become  very  important  in  the  near 
future  to  device  codes  which  allow  more  than  one  form  of  parallelism  to  be 
utilized.  The  Cedar  architecture  [10]  was  mainly  used  to  develop  a  first 
version  of  McSparse.  Currently  the  code  is  being  ported  onto  a  Cray  2  and 
a  Cray  YMP. 

McSparse  is  based  on  finding  large  grain  parallelism  such  that  the  fac¬ 
torization  of  a  general  sparse  matrix  A  can  be  divided  up  into  partitions 
which  allow  the  clusters  or  processors  to  work  in  parallel  on  the  problem. 
(On  Cedar,  finer  granularity  parallelism  can  be  exploited  within  a  cluster.) 
Large  grain  parallelism  is  obtained  by  ordering  the  matrix  by  a  hybrid  order¬ 
ing  H*  which  is  related  to  tearing  techniques  for  nonsymmetric  matrices  [5]. 
Tearing  techniques,  however,  are  based  mostly  on  nonsymmetric  orderings 
(column  and  row  interchanges).  The  H*  algorithm  differs  in  that  it  uses 
an  initial  nonsymmetric  ordering  (HO)  and  subsequent  symmetric  orderings 
to  achieve  the  desired  form.  The  H*  ordering  produces  a  bordered  block 
upper-triangular  form.  The  ordering  is  a  hybrid  form  combining  Tarjan't  al¬ 
gorithm  [17],  the  HI  ordering  which  it  derived  from  Tarjan’t  algorithm,  and 
the  H2  ordering  which  is  based  on  a  modified  form  of  Netted  Dissection  [8] 
that  exploits  the  nonsymmetric  structure  of  the  matrix.  Additionally,  H* 
tries  to  enhance  the  stability  of  the  pivoting  strategies  used  in  the  factor¬ 
ization  algorithm.  This  is  accomplished  via  three  techniques.  First,  the  HO 
phase  generates  a  transversal  such  that  the  element  on  the  diagonal  for  each 
column  is  within  a  bound  of  the  largest  element  within  the  column.  In  the 
HI  and  H2  phases,  two  techniques  are  used  which  monitor  the  size  of  the 
elements  placed  in  the  border.  In  figure  1  a  sparse  matrix  (bp_800  from  the 
Harwell/Boeing  collection  [3])  is  depicted  before  and  after  being  reordered 
by  H*. 

The  factorization  and  solution  of  the  partitioned  system  can  exploit  the 
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above  ordering  in  several  ways.  These  include  modifications  to:  a  stan¬ 
dard  sparse  LU  decomposition,  a  block  LU  decomposition,  a  decomposition 
combined  with  a  low  rank  update  such  as  the  Woodbury  formula  [9],  and 
iterative  methods  which  may  use  the  decomposition  as  a  preconditioner. 
For  the  moment  only  attention  is  paud  to  the  modifications  of  the  standard 
sparse  LU  decomposition  to  exploit  the  form  effectively  and  generate  an 
accurate  solution. 

The  implementation  of  the  solver  first  performs  the  H*  ordering  on  a  sin¬ 
gle  cluster  and  then  distributes  the  matrix  to  the  other  clusters  and  places 
certain  portions  of  the  border  in  the  global  memory  to  be  shared  by  the 
clusters.  The  clusters  then  compute  the  LU  decomposition  of  each  diago¬ 
nal  block.  This  factorisation  can  be  done  using  dense  techniques  suitable 
for  the  cluster’s  architecture  (BLASS-based  partial  pivoting)  or  a  parallel 
sparse  factorisation  routine  which  exploits  finer  grain  parallelism.  Of  course, 
the  diagonal  blocks  which  result  from  H*  may  not  be  nonsingular  or  well- 
conditioned.  This  is  handled  by  either  artificially  forcing  nonsingularity  or 
by  dynamically  redefining  the  border  during  the  factorization.  In  the  latter 
technique  the  unknowns  which  cause  difficulty  are  east  into  the  border  and 
eliminated  at  a  later  stage.  After  these  factorizations  are  completed  the 
off-diagonal  entries  in  the  upper  triangular  part  of  the  matrix  are  updated. 
The  entries  in  the  border  are  then  updated.  This  update  may  exploit  the 
hierarchical  structure  of  the  border  depending  upon  its  size.  Finally,  the 
diagonal  block  coupling  the  border  elements  is  factored.  This  is  typically 
dense  but  may  be  performed  on  a  single  cluster  or  multiple  clusters  depend¬ 
ing  on  its  size.  The  forward  solve  Is  partitioned  naturally  as  a  result  of  the 
factorisation.  However,  during  the  factorization  phase  the  U  matrix  is  redis¬ 
tributed  to  allow  an  efficient  backward  solve  (which  may  be  done  repeatedly 
if  coupled  with  an  iterative  method). 

3  Sparse  Basic  Linear  Algebra  Subroutines  for 
Iterative  Methods 

As  was  already  mentioned  in  the  introduction,  identifying  computational 
primitives  for  sparse  codes  might  not  always  be  possible,  however,  for  it¬ 
erative  methods  these  primitives  are  easy  to  identify.  Recently  many  pa¬ 
pers  appeared  which  describe  efficient  implementations  for  triangular  solves 
and  matrix-vector  multiplies  [1,6,15].  In  this  section  a  systematic  way  is 
described  how  to  obtain  efficient  implementations  for  these  computational 
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kernels  on  a  vector /concurrent  architecture. 

Sparse  computations  are  diaracterised  by  the  intrinsic  complexity  of  the 
data  handling.  In  order  to  cope  with  the  complexity  of  data  handling,  the 
design  of  sparse  BLAS  primitives  should  encompass  both  the  data  manipu¬ 
lation  capabilities  of  the  architecture  as  well  as  the  requirements  imposed  by 
the  sparse  computation.  This  is  achieved  by  essentially  taking  the  following 
four  steps: 

•  Defining  the  suitable  Data  Access  Types 

•  Handling  Data  Locality 

•  Handling  the  Irregularity  of  the  Computation 

•  Handling  Parallelism. 

We  will  demonstrate  this  for  the  Sparse  Matrix  A  times  Dense  Ma- 
trix(or  Vector)  B  primitive  (SpMxM(V)).  The  primitive  SpMxM  derives, 
for  instance,  from  an  iterative  method  to  obtain  the  eigenvectors  of  a  sparse 
matrix.  At  each  iteration  the  iteration  matrix  is  multiplied  with  the  approx¬ 
imates  of  the  eigenvectors.  The  primitive  SpMxV,  which  is  a  special  case 
of  the  former  one,  is  the  crucial  component  with  respect  to  performance  in 
most  iterative  solvers. 

In  vector/concurrent  architectures,  e.g.  CEDAR,  Alliant  FX  series,  Cray 
series,  there  are  essentially  two  different  types  of  data  access:  vector  access 
and  scalar  access.  Fbr  the  primitive  SpMxM  we  can  think  of  two  possible 
ways  of  realising  these  vector  accesses.  One  realisation  is  based  upon  the 
row/columns  of  matrix  A  and/or  B,  and  the  other  realisation  is  obtained 
by  extending  each  row  (column)  of  A  to  a  full  row  (column)  by  shifting  all 
the  nan-sero  entries  of  A  to  the  top  (to  the  right).  The  latter  extended 
rows  (columns)  are  also  called  jagged  diagonals,  generalised  columns  and 
stripes,  see  [6,12,14].  A  represents  the  sparse  matrix  and  B  the  dense  matrix 
throughout  this  section.  The  following  table  depicts  which  combinations  of 
these  accesses  makes  sense  for  the  implementation  of  SpMxM: 


A 

scalar 

row 

column 

ext.  row 

ext.  column 

B 

scalar 

X 

X 

row 

X 

column 

mm 

X 
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The  scalar-A/scalar-B  version  is  certainly  implementable  but  does  not  ex¬ 
ploit  the  vector  capabilities  of  the  architecture  under  investigation.  This 
leaves  us  with  four  different  types  of  implementation  for  SpMxM. 

For  multiprocessors  with  an  hierarchical  structured  memory  system  care 
has  to  be  taken  with  respect  to  data  locality.  By  this  we  the  ability  to  keep 
data  in  the  highest  level  of  a  system,  e.g.,  vector  registers  or  cache,  for  as 
long  as  possible  during  a  computation.  The  data  locality  of  a  computation 
is  largely  determined  by  the  time  delay  between  re-usage  of  data.  The 
following  strategy  is  sued  to  optimize  data  handling  in  these  architectures. 

1.  Vector  Register  utilization:  reduction  of  the  number  of  data  streams 
to  be  accessed  by  each  CE. 

2.  Cache  utilization:  reduction  of  the  length  of  the  data-streams. 

The  reduction  of  the  number  of  data  streams  is  obtained  by  keeping  each 
operand  as  long  as  possible  in  a  vector  register  during  the  computations.  For 
each  of  the  above  mentioned  four  versions  this  amounts  into  two  variants. 
For  instance,  for  the  icalar-A/row-B  version,  either,  each  row  of  A  can  be 
kept  in  a  vector  register  for  as  long  as  possible,  or  each  row  of  the  result 
matrix.  The  number  of  data  streams  can  be  even  further  decreased  by 
applying  a  blocking  technique.  By  blocking  we  mean  that  the  innermost 
loop  of  a  nested  loop  is  not  iterated  for  a  maximal  number  of  times,  but 
only  in  chunks  of  a  certain  length.  The  following  table  shows  the  reduction 
of  the  number  of  datastreams  for  each  of  the  eight  versions. 


Version 

#  Datastreams 

1" 

IB 

2A 

2B 

3A 

SB 

4A 

4B 

mm 

KB 

KB 

KB 

KB 

KB 

5 

KB 

Before  Blocking 

IKS 

MM 

KB 

El 

MM 

KB 

3 

El 

After  Blocking 

isa 

ei 

El 

KB 

EH 

El 

El 

Ell 

The  entries  in  this  table  which  are  suffixed  by  a  *  indicate  a  non-optimal 
reduction  of  the  data  streams  caused  by  the  occurrence  of  indirect  address¬ 
ing- 

The  length  of  the  data  streams  can  be  reduced  by  again  applying  a 
blocking  technique,  This  blocking  technique  is  applied  in  the  same  way 
as  the  above  mentioned,  but  its  functionality  is  quite  different.  Whereas 
both  blocking  techniques  try  to  decompose  a  D  O-loop  in  chunks  of  a  certain 
length,  the  first  mentioned  blocking  is  constrained  to  the  vector  processing 
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capabilities  of  an  arcb’tecture.  In  older  to  distinguish  the  two  forms  of 
blocking  we  call  the  latter  one  vertical  blocking. 

The  use  of  condensed  storage  formats  for  the  sparse  matrix  can  have 
different  impacts  on  the  8  implementations.  First  indirect  addressing  can 
be  caused  in  the  innermost  loop  body.  Secondly  the  loop  boundaries  can 
become  dependent  on  the  structure  of  the  matrix,  and,  thirdly,  the  length 
of  the  vector  operations  may  be  affected.  If  we  summarize  all  the  effects 
of  using  a  condensed  storage  format  for  the  sparse  matrix,  we  obtain  the 
following  table: 


Version 

1A 

IB 

2A 

2B 

3A 

3B 

4A 

4B  | 

Effect 

Ind.  Addr. 

— 

. 

X 

X 

X 

X 

X 

Inner  Lp.  Bnd. 

x 

X 

Outer  Lp.  Bnd. 

X 

Vector  Length 

_ 

~X1 

~xH 

~x~i 

~X1 

For  a  more  detailed  account  of  these  Issues,  and,  specifically,  the  han¬ 
dling  of  parallelisation  and  experimental  data  on  the  implementation  of  the 
primitives  SpMxM(V)  the  reader  is  referred  to  [18]. 

Concluding  we  can  say  that  sparse  BLAS  implementations  can  speed 
up  the  performance  of  sparse  computations  considerably.  So,  in  the  case  of 
the  Alliant  FX/8  (FX/80),  on  which  the  performance  of  unstructured  sparse 
computations  lies  within  the  range  of  0.1-10  Mflops,  specialized  higher  order 
BLAS  routines  can  speedup  this  performance  to  5-30  (8-50)  Mflops. 
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Abstract 

Modem  supercomputers  like  CRAY  X-MP  and  CRAY 
Y-MP  achieve  their  high  computing  speed  by  using  both 
vector  and  parallel  hardware.  This  paper  gives  a  short  in¬ 
troduction  into  the  CRAY  Y-MP  system  architecture  and 
describes  the  multitasking  concepts  which  can  be  used  on 
this  machine.  There  are  three  different  concepts  which 
support  parallelism  on  the  programming  language  level: 
macro  tasking,  micro  tasking,  and  autotasking.  Macro  task¬ 
ing  supports  coarse-grain  parallelism  on  the  level  of  sub¬ 
routines.  With  microtasking,  fine-grain  parallelism  can  be 
used  even  on  the  level  of  DO  loops.  In  contrast  to  these 
concepts  where  the  multitasking  primitives  have  to  be  in¬ 
troduced  by  hand,  the  new  autotasking  concept  offers  an 
automatic  way  for  finding  parallelism  in  existing 
FORTRAN  programs.  The  concepts  and  implementa¬ 
tions  arc  discussed,  and  measurements  of  the  overhead  as 
well  as  performance  results  for  kernels  and  an  application 
program  are  presented. 

Moreover,  the  overall  system  performance  is  of  interest 
when  multitasking  concepts  are  used.  Therefore,  a  pro¬ 
gramming  system  is  developed,  generating  synthetic  user 
programs  which  simulate  a  given  work  load  in  a  flexible 
way.  The  resulting  benchmark  environment  is  used  to  in¬ 
sert  additional  sequential  as  well  as  parallel  programs.  This 
technique  guarantees  constant  system  load  and  enables 
reasonable  comparisons.  First  results  obtained  from  these 
investigations  have  proved  the  efficiency  especially  for  the 
fine-grain  concepts  which  provide  good  performance  in 
dedicated  as  well  as  batch  oriented  multiprogramming  en¬ 
vironments;  for  selected  production  codes  on  the  CRAY 
Y-MP  these  concepts  are  now  used  to  evaluate  their  effects 
on  a  loaded  system. 

Keyword*.  CRAY  Y-MP,  multitasking,  macro  tasking, 
microtasking,  autotasking,  parallel  programming,  linear  al¬ 
gebra  kernels,  benchmark  environment,  hardware  per¬ 
formance  monitor. 


1.  Introduction 

The  CRAY  X-MP  and  CRAY  Y-MP  are  multiprocessor 
systems  with  a  shared  memory.  A  short  introduction  into 
the  system  architecture  is  given  in  section  2.  On  these 
vector-supercomputers,  parallelism  on  the  programming 
language  level  is  handled  by  three  modes  of  multitasking. 
Macro  tasking  supports  parallelism  on  the  subroutine  level 
([7,13,18]).  Task  creation,  synchronization,  and  commu¬ 
nication  are  specified  explicitly  by  the  programmer  using 
subroutine  calls.  Macrotasking  exploits  the  intrinsic 
parallelism  of  a  problem  by  partitioning  the  computations 
into  N  tasks  which  are  simultaneously  executed  on  N 
processors.  Inefficiencies  arise  when  the  tasks  are  not  well 
balanced  or  synchronization  is  needed  to  often  ([4]). 

The  second  strategy  is  called  microtasking  and  works  on 
the  statement  level  ([7,12]).  It  makes  use  of  compiler  di¬ 
rectives  inserted  by  the  programmer.  These  directives  are 
passed  to  a  preprocessor  which  generates  subroutine  calls 
for  the  creation  of  parallel  tasks  and  their  synchronization. 
In  contrast  to  macrotasking,  the  program  parallelism  is 
dynamically  mapped  to  the  number  of  CPUs  available  at 
run  time. 

The  third  strategy,  which  has  been  implemented  recently, 
is  called  autotasking  ([1,10,17,23]).  It  is  based  on  im¬ 
proved  dependency  analysis  techniques  which  provide  an 
automatic  mechanism  for  detecting  parallel  regions  of  code 
(normally  DO  loops)  without  user  intervention.  To  im¬ 
prove  the  performance,  this  process  may  be  supported  by 
additional  informational  preprocessor  directives  specified 
by  the  programmer.  In  comparison  with  microtasking 
functionality  is  improved.  The  parallel  primitives  are 
slightly  changed,  whereas  the  synchronization  techniques 
used  have  been  proved  to  be  efficient  also  in  the  micro- 
tasking  implementation.  In  contrast  to  microtasking, 
autotasking  supports  the  flexible  definition  of  parallel  re¬ 
gions  at  any  place  in  a  subroutine,  for  each  of  the  parallel 
regions  the  data  scope  is  analyzed  and  can  be  defined  ex- 
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plicitly.  These  multitasking  concepts  available  on  CRAY 
machines  are  described  in  detail  in  section  3. 


An  evaluation  of  multitasking  concepts  should  also  take 
into  account  runtime  measurements.  There  are  several 
ways  to  assess  the  efficiency  of  multitasking  implementa¬ 
tions  (see  also  Fig.  1): 

1)  Overhead  measurements  of  multitasking  primitives: 
The  execution  of  multitasking  primitives  causes  over¬ 
head  due  to  the  runtime  used  by  the  primitives  itself. 
Section  3  provides  a  set  of  overhead  timings  for  the 
main  multitasking  primitives. 

2)  Performance  measurements  of  program  kernels: 

In  order  to  explore  the  strength  of  the  multitasking 
implementations  for  kernels  often  used  in  scientific 
programs,  parallel  algorithms  for  linear  algebra  prob¬ 
lems  ([12,21])  have  been  implemented  and  extensively 
studied  on  the  CRAY  multiprocessor  systems.  A 
short  summary  of  the  results  can  be  found  in 
section  4. 

3)  Performance  measurements  of  application  programs: 
Section  5  presents  results  for  a  large  application  pro¬ 
gram  implemented  on  the  CRAY  systems  using  mul¬ 
tiple  CPUs.  Moreover,  this  section  describes  program 
modifications  which  are  useful  to  improve  load  bal¬ 
ancing  and  to  exploit  the  total  parallelism. 

4)  Measurements  of  the  overall  system  performance: 

To  analyze  efficiency  of  multitasking  concepts  with 
respect  to  the  total  computer  system,  benchmark 
programs  are  used  to  generate  a  well-defined  work 
load.  The  generation  and  selection  of  such  benchmark 
programs  is  fundamental  for  the  performance  evalu¬ 
ation.  Section  6  describes  a  program  system  devel¬ 
oped  by  the  author  which  generates  benchmark  pro¬ 
grams  capable  to  simulate  any  given  user  load  in  a 
flexible  way.  This  system  is  used  on  a  CRAY  multi¬ 
processor  system  to  assess  the  efficiency  of  the  multi¬ 
tasking  concepts  with  respect  to  both  user  programs 
and  operating  system  activities. 


/  Performance 


Application  Programs 

Fig.  1.  Evaluation  of  multitasking  concepts  with  re¬ 
spect  to  four  areas 

tations.  The  processors  have  a  cycle  time  of  6  nsec.  The 
main  memory  consists  of  up  to  64  megawords  and  is  re¬ 
alized  in  ECL  technology  providing  an  access  time  of 
30  nsec.  Compared  with  CRAY  X-MP,  the  memory  or¬ 
ganization  has  been  significantly  improved;  a  description 
of  the  differences  can  be  found  in  ([8]).  The  detailed  sys¬ 
tem  characteristics  of  the  two  CRAY  multiprocessor  sys¬ 
tems  installed  at  KFA  are  presented  in  Tab.  1. 


CRAY 

X-MP/416 

CRAY 

Y-MP8/832 

number  of  CPUs 

4 

8 

CPU  cycle  time 

8.5  nsec 

6  nsec 

number  of  functional 
units  per  CPU 

13 

13 

main  memory 

16  MW 
(organized  in 
64  banks) 

32  MW 
(organized  in 
256  banks) 

memory  access  time 

34  nsec 

30  nsec 

Solid-state 

Storage  Device 
(SSD) 

32  MW 

128  MW 

2.  The  CRA  Y  Y-MP  Multiprocessor  System 

The  CRAY  Y-MP  is  a  shared  memory  multiprocessor 
system  with  up  to  8  processors,  and  it  is  the  successor 
system  of  the  CRAY  X-MP.  A  description  of  the  CRAY 
X-MP  system  can  be  found  in  [12];  as  far  as  the  logical 
structure  is  concerned  both  CPU  types  are  identical.  Each 
CRAY  Y-MP  CPU  is  a  high-speed  vector  processor  with 
specialized  pipelined  functional  units  which  can  be  utilized 
in  parallel  to  perform  high-speed  (floating  point)  compu- 


Tab.  I.  System  characteristics  of  the  CRAY  multi¬ 
processors  installed  at  KFA 

The  CPUs  of  one  CRAY  Y-MP  system  are  tightly  coupled 
via  the  shared  main  memory  and  9  identical  groups  of 
registers,  called  clusters,  containing  8  shared  address  regis¬ 
ters  (SB),  8  shared  scalar  registers  (ST),  and  32  binary 
semaphore  registers  (SM)  each.  These  registers  can  be  ac¬ 
cessed  by  all  processors;  depending  on  program  require¬ 
ments  either  one  cluster  or  multiple  clusters,  or  eventually 
no  cluster  will  be  attached  to  a  CPU.  These  hardware 
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features  provide  the  basis  for  high-speed  communication 
between  parallel  tasks  when  multitasking  is  used. 

3.  Multitasking  Strategies 

The  multitasking  concepts  provided  by  CRAY  Research 
Inc.  support  the  exploitation  of  parallelism  on  different 
programming  levels.  They  are  realized  by  means  of  sub¬ 
program  libraries,  which  are  linked  together  into  the  library 
UTLIB,  and  they  allow  the  execution  of  one  program  in 
parallel  using  tasks.  In  multitasking  terminology,  a  task  is 
a  synonym  for  a  user  library  task.  User  library  tasks  are 
generated  by  calls  to  the  multitasking  library  subroutines. 
Each  multitasking  program  is  handled  by  the  library 
scheduler  which  belongs  to  the  multitasking  library.  The 
library  scheduler  is  the  interface  between  the  multitasking 
program  and  the  operating  system.  It  handles  the  user  li¬ 
brary  tasks  and  connects  them  to  exchange  packages. 
Afterwards,  the  job  scheduler  will  find  all  necessary  infor¬ 
mation  in  the  exchange  package  in  order  to  attach  a  phys¬ 
ical  CPU  to  the  task. 

For  a  program  which  does  not  use  multitasking  at  all,  the 
master  task  is  executed.  Within  a  program  which  uses 
multitasking,  the  master  task  will  create  further  tasks  by 
calling  library  subprograms.  These  new  tasks  are  executed 
in  parallel  with  the  master  task. 

A  subprogram  may  be  called  by  more  than  one  task  si¬ 
multaneously  Thus,  it  is  necessary  to  guarantee  that  each 
task  can  access  local  variables  without  conflicts.  Therefore , 
all  local  variables  of  the  subprogram  must  be  stored  within 
the  local  memory  area  of  each  task  (stack);  the  c/(77  com¬ 
piler  enables  this  feature.  Subprograms  compiled  with  this 
option  are  called  reentrant. 

If  a  non-reentrant  subprogram  is  to  be  executed  by  multi¬ 
ple  tasks  in  parallel,  for  each  task  a  copy  of  this  subpro¬ 
gram  with  a  different  name  must  be  used,  or  the  subpro¬ 
gram  called  must  be  imbedded  into  a  critical  region.  In 
order  to  guarantee  correct  results  when  manipulating  vari¬ 
ables  and  data  elements  within  a  critical  region,  this  region 
must  not  be  entered  by  more  than  one  task  at  a  time. 


3.1.  The  Macrotasking  Concept 

Macrotasking  is  the  kind  of  multitasking  where  parallelism 
is  realized  at  subprogram  level.  Within  a  program  subpro¬ 
grams  may  be  executed  as  different  tasks.  In  order  to  use 


macrotasking,  the  application  has  to  be  partitioned  into  a 
fixed  number  of  tasks;  the  user  has  to  create  the  tasks  ex¬ 
plicitly  by  subprogram  calls.  Task  management  is  done 
by  the  library  scheduler.  The  library  scheduler  gets  infor¬ 
mation  about  synchronization  and  communication  re¬ 
quirements  via  the  subprogram  calls.  It  controls  special 
queues  and  initiates  the  necessary  activities.  The  job 
scheduler  attaches  the  tasks  to  physically  available  CPUs 
([7,13]).  These  tasks  can  be  executed  correctly  in  parallel, 
if  there  are  no  prohibitive  mutual  dependencies  concerning 
synchronization  and  sequencing. 

The  macrotasking  library  consists  of  four  different  parts, 
which  can  be  characterized  by  different  functions  of  the 
imbedded  routines: 

1)  Routines  to  manipulate  tasks: 

•  TS  K  ST  A  RT(  tskarray.name[ .  list} ) 

generates  a  new  task  with  identification  tskarray 
for  subroutine  name  and  passes  the  parameters 
of  list  to  the  subroutine. 

•  TSKWAITftrfcamty) 

waits  for  the  end  of  the  task  identified  by 
tskarrav. 

2)  Routines  to  control  events': 

•  E  V  A  S  G  N  ( name[ .  value] ) 

declares  the  INTEGER  variable  name  as  event 
variable. 

•  EVWAIT(name) 

wails  for  the  event  name. 

•  EVPOST(name) 

signals  the  event  name  to  the  scheduler. 

3)  Routines  to  control  barriers2: 

•  BARASGN(name[,va/ue]) 

declares  the  INTEGER  variable  name  as  barrier 
variable,  value  specifies  the  number  of  tasks 
which  have  to  synchronize. 

•  BARSYNC(name) 

signals  to  the  scheduler  that  this  task  has  arrived 
at  the  barrier  specified  with  name. 

4)  Routines  to  control  critical  regions: 

•  UOCKASGN(name[.va/ue]) 

declares  the  INTEGER  variable  name  as  lock 
variable. 


t  Etents  can  be  used  to  force  a  certain  order  of  execution  between  tasks. 

2  Barriers  are  used  to  assure  that  all  tasks  have  reached  a  particular  program  location. 
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•  LOCKON(name) 

signals  to  the  scheduler  that  this  task  will  enter 
the  critical  region  connected  to  the  variable 
name. 

•  LOCKOFF(/wme) 

signals  to  the  scheduler  that  this  task  leaves  the 
critical  region  connected  to  the  variable  name. 

Multitasking  primitive  timings  as  measured  on  CRAY 
Y-MP  can  be  found  in  Tab.  2  at  the  end  of  section  3. 
For  the  generation  and  termination  of  tasks  some  addi¬ 
tional  work  is  needed  to  allocate  storage  locations.  The 
basic  overhead  for  generating  a  new  task  via  a 
TSKSTART  call  is  about  62  microseconds.  The  syn¬ 
chronization  of  the  data  access  to  global  variables  (i.e.  for 
reading  or  updating)  can  be  done  using  events,  locks,  and 
barriers.  The  overhead  for  an  event  post 3  -  event  wait  is 
about  16.1  microseconds,  the  overhead  for  a  lock  set  -  lock 
free  is  about  $.5  microseconds.  The  generation  of  lock  and 
event  variables  as  well  as  the  release  of  such  variables  takes 
from  2  to  3  microseconds.  For  a  barrier  synchronization 
of  two  tasks  about  13.6  microseconds  have  to  be  paid,  the 
assignment  of  the  barrier  variable  costs  about  6.S  micro¬ 
seconds,  the  release  of  this  variable  about  3.6  microsec¬ 
onds. 

For  certain  kinds  of  programs,  macrotasking  leads  to  effi¬ 
cient  use  of  the  multiprocessor  machine.  But  often  the  user 
has  to  deal  with  three  types  of  problems  which  may  reduce 
the  speedup  achievable  by  this  multitasking  strategy: 

1)  For  small  granularity,  the  system  overhead  is  too  large 
due  to  the  high  synchronization  frequency  involved. 

2)  Some  of  the  tasks  cannot  be  executed  in  a  balanced 
way  on  the  multiple  CPUs. 

3)  The  number  of  processors  does  not  match  the  fixed 
number  of  tasks  specified  within  the  macrotasking 
program. 

Several  manufacturers  of  multiple-CPU  systems  like 
CRAY  and  IBM  are  promoting  developments  in  the  field 
of  parallel  programming,  in  particular  to  reduce  synchro¬ 
nization  overhead  within  multitasking.  Micro  tasking  is  an 
approach  that  allows  the  efficient  use  of  multiple  process¬ 
ors  even  for  small  granularity  ([12,18]). 

3.2.  The  Mfcrotasking  Concept 

Micro  tasking  provides  parallelization  on  the  statement 
level,  thus  providing  a  different  interface  to  multitasking. 
Statements,  sequences  of  statements,  and  complete  sub¬ 
programs  may  be  executed  in  parallel  on  multiple  process¬ 


3  One  task  is  still  waiting  on  the  event  which  is  posted. 


ors  of  a  CRAY  Y-MP  machine,  for  instance,  when 
microtasking  is  used. 

Instead  of  using  a  scheduler  all  communication  and  syn¬ 
chronization  is  done  by  accessing  shared  registers  directly 
([7,14]).  The  overhead  of  the  communication  via  registers 
is  much  smaller  than  the  overhead  to  access  corresponding 
variables  in  main  memory  as  realized  within  the  macro- 
tasking  library  routines. 

The  microtasking  features  are  provided  by  preprocessor 
directives  ([7, 14]).  These  directives  are  coded  into  the  user 
program;  this  leads  to  multitasking  programs  which  are 
easier  to  understand  than  in  the  macrotasking  version.  The 
usage  of  directives  does  not  affect  the  portability  of  pro¬ 
grams.  Because  the  directives  must  be  coded  with  a  'C'  in 
the  first  column,  all  FORTRAN  compilers  will  interpret 
such  statements  as  comment  statements. 

The  microtasking  strategy  is  based  on  the  preprocessor 
PREMULT  ([7,14]),  which  takes  the  user  program  as  in¬ 
put  and  generates  three  separate  subroutines  for  each  sub¬ 
routine  which  is  to  employ  microtasking: 

1)  Master  routine: 

The  master  routine  is  a  program  coded  in  assembler 
language,  which  is  called  by  the  same  name  as  the 
original  program.  This  routine  decides  whether  the 
parallel  version  of  a  subprogram  will  be  used  or  not. 

2)  SNGL  version: 

The  SNGL  version  contains  FORTRAN  program 
code  for  the  single-task  execution  of  the  program. 

3)  MULT  version: 

The  MULT  version  is  a  modified  FORTRAN  version 
of  the  subprogram  to  be  executed  in  parallel. 

Within  the  MULT  version  PREMULT  translates  direc¬ 
tives  into  corresponding  library  calls.  There  are  several 
directives  available  to  the  user;  they  can  be  classified  ac¬ 
cording  to  their  usage: 

1)  Demand  and  return  of  physical  CPUs: 

•  CM1CS  GETCPUS  n 

declares  the  maximum  number  of  CPUs  the 
program  can  use. 

•  CMICJ  RELCPUS 

returns  the  required  CPUs  back  to  the  operating 
system. 

2)  Definition  of  control  structures: 

•  CMICS  DO  GLOBAL 

marks  a  DO  loop  to  be  executed  in  parallel  if 
more  than  one  CPU  is  available. 
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•  CMICS  PROCESS 

marks  the  beginning  of  a  program  part  which 
must  be  executed  by  one  processor. 

•  CMICJ  ALSO  PROCESS 

marks  the  beginning  of  a  further  program  part 
which  could  be  executed  in  parallel  to  other 
processes.  This  directive  may  be  used  several 
times. 

•  CMICS  END  PROCESS 

marks  the  end  of  a  PROCESS  part. 

•  CMICS  STOP  ALL  PROCESS 

provides  a  way  to  stop  parallel  execution  before 
all  computations  are  finished. 

3)  Safeguarding  of  critical  regions: 

•  CMICS  GUARD  n 

marks  the  beginning  of  the  critical  region 
named  n. 

•  CMICS  END  GUARD  n 

marks  the  end  of  the  critical  region  named  n. 

The  safeguard  directives  are  also  translated  within  the 

SNGL  version  because  exclusive  access  to  variables 

also  must  be  guaranteed  for  sequential  subroutines 

which  are  called  by  microtasking  tasks. 

Contrary  to  macro  tasking,  microtasking  parallelism  is 
specified  by  the  definition  of  control  structures.  A  micro- 
tasking  control  structure  declares  a  part  of  a  program 
which  must  be  finished  before  a  code  part  outside  of  this 
region  may  be  executed.  Within  such  a  control  structure 
processes  are  defined.  A  microtasking  process  is  a  sequence 
of  instructions  which  always  is  executed  as  a  single  task. 
Microtasking  distributes  such  processes  dynamically  to  the 
actually  available  CPUs.  Program  segments  outside  of 
microtasking  control  structures  are  executed  by  all  tasks  in 
an  unpredictable  sequence. 

The  dispatch  and  wait  overhead  for  scheduling  a  micro - 
tasking  subroutine  is  low  and  costs  about  2.8  microseconds 
on  a  CRAY  Y-MP.  The  overhead  of  executing  a  parallel 
loop  instead  of  a  normal  DO  loop  is  0.450  *  ch  +  2.6 
microseconds  (ch  is  the  number  of  chunks,  each  chunk 
contains  a  definite  number  of  loop  iterations)  which  cor¬ 
responds  to  75  *  ch  +  433  CPU  cycles.  To  execute  several 
PROCESS  directives  about  1.0  *  pd  +  2.3  microseconds 
( pd  is  the  number  of  PROCESS  directives)  must  be  paid, 
and  locking  a  critical  section  using  the  GUARD  directives 
causes  an  overhead  of  0.8  microseconds. 


Microtasking  has  proved  high  efficiency  of  the  synchroni¬ 
zation  primitives,  but  still  the  user  has  to  introduce 
parallelism  manually  by  inserting  preprocessor  directives. 

3.3.  The  Autotasking  Concept 

The  main  goal  of  the  CRAY  autotasking  concept  is  to 
provide  an  efficient  and  automatic  mechanism  to 
parallelize  programs  on  CRAY  multiprocessors  (see 
[1,23]).  Autotasking  integrates  the  microtasking  concept: 
most  of  the  communication  and  synchronization  is  done 
also  by  accessing  shared  semaphore  registers  directly.  Ad¬ 
ditional  tasks  execute  a  Wait-on-Semaphore  operation  on 
a  shared  register.  If  there  is  any  parallel  work  to  do,  the 
semaphore  is  reset,  and  all  tasks  are  able  to  execute  parts 
of  the  work.  CRAY  autotasking  will  support  fine-grain 
parallelism  on  all  CRAY  machines  running  UNICOS4. 

In  contrast  to  microtasking,  fine-grain  parallelization  with 
autotasking  is  done  by  the  automatic  parallelization  of  DO 
loops.  The  features  are  provided  by  the  compiling  system 
c/77  which  can  be  partitioned  into  three  parts:  fpp,  fmp, 
and  cft77.  The  preprocessor  fpp  analyzes  FORTRAN 
programs;  it  is  able  to  detect  DO  loops  executable  in  par¬ 
allel,  and  it  marks  these  loops  with  preprocessor  directives. 
These  directives  are  translated  by  the  fmp  preprocessor 
which  is  the  second  part  of  the  c/77  compiling  system,  and 
the  c/(77  generates  machine  code  for  the  modified 
FORTRAN  program. 

The  autotasking  primitives  represent  a  superset  of  the 
microtasking  primitives  with  several  extensions  improving 
functionality.  With  microtasking,  a  parallel  region  has  to 
begin  at  the  top  of  a  subroutine.  This  has  been  changed 
significantly:  autotasking  supports  flexible  definition  of 
parallel  regions  at  any  place  of  a  subroutine.  The  pre¬ 
processor  fmp  generates  a  set  of  subroutines  for  each  of  the 
parallel  regions,  and  all  available  CPUs  will  enter  at  the 
beginning  of  these  subroutines.  Implicit  synchronization 
takes  place  at  the  bottom  of  the  code  marked  by  the 
PARALLEL  DO,  the  END  CASE,  or  the  DO  ALL  di¬ 
rectives.  For  each  of  the  parallel  regions  the  data  scope  of 
the  variables  (SHARED  or  PRIVATE)  is  analyzed  and 
can  be  defined  explicitly. 

In  comparison  with  microtasking,  only  the  directives  which 
define  multitasking  control  structures  are  changed,  the  di¬ 
rectives  for  CPU  handling  and  safeguarding  remain  un¬ 
changed.  The  following  list  gives  a  short  description  of  the 
new  autotasking  primitives 


4  An  autotasking  version  for  the  COS  operating  system  is  being  released  just  now. 
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•  CMICS  PARALLEL 

marks  the  beginning  of  a  parallel  region  where  multi¬ 
ple  CPUs  may  enter.  The  parallel  processes  within 
these  parallel  regions  have  to  be  marked  with  PAR¬ 
ALLEL  DO  or  CASE  directives. 

•  CMICS  END  PARALLEL 
marks  the  end  of  a  parallel  region. 

•  CMICS  DO  PARALLEL 

marks  a  DO  loop  to  be  executed  in  parallel  if  more 
than  one  CPU  is  available  within  a  parallel  region. 

•  CMICS  DO  ALL 

defines  a  separate  parallel  region  for  the  following  DO 
loop  only.  In  addition,  this  DO  loop  is  marked  to  be 
executable  in  parallel  by  more  than  one  CPU. 

•  CMICS  CASE 

marks  the  beginning  of  a  program  part  which  could 
be  executed  in  parallel  to  other  processes.  The  direc¬ 
tive  may  be  used  several  times. 

•  CMICS  END  CASE 

marks  the  end  of  a  CASE  part. 

•  CMICS  SOFT  EXIT 

provides  a  way  to  stop  parallel  execution  before  all 
computations  are  finished. 

Several  directive  options  like  threshold  tests,  flexible  defi¬ 
nition  of  chunk  sizes  (including  the  guided  self  scheduling 
approach  [22]),  automatic  partitioning  of  long  vector 
loops  etc.  have  been  introduced  to  optimize  execution  of 
parallel  regions.  Furthermore,  there  are  some  minor  syn¬ 
tax  changes.  Compared  with  micro  tasking  under  the  COS 
operating  system,  the  synchronization  speed  is  drastically 
increased.  Instead  of  calling  a  function  as  within  micro- 
tasking,  autotasking  uses  inline  code  for  synchronization. 
The  overhead  of  executing  a  parallel  loop  instead  of  a 
normal  DO  loop  (see  also  [23])  is  about  0.150  *  ch  +  1.1 
microseconds  on  a  CRAY  Y-MP  (ch  is  the  number  of 
chunks)  which  corresponds  to  25  *  ch  +  183  CPU  cycles. 
A  critical  section  can  be  locked  using  the  GUARD  direc¬ 
tives  leading  to  0.95  microseconds  overhead. 

Autotasking  can  coexist  with  macrotasking  and  micro- 
tasking  within  one  program  system,  but  autotasking  and 
micro  tasking  are  not  allowed  to  be  used  together  within 
one  subroutine.  At  the  moment,  nested  parallelism  (i.e. 
parallel  loops  inside  a  parallel  loop)  is  not  supported  by 
autotasking. 

The  basic  overhead  of  all  multitasking  concepts  is  sum¬ 
marized  in  Tab.  2.  Owing  to  the  actual  implementation, 
the  macrotasking  and  micro  tasking  measurements  are  done 
under  the  COS  Rel.  1.17  operating  system  whereas 
autotasking  is  measured  under  UNICOS  S.O. 


Multitasking 

primitive 

CRAY  Multitasking  Concepts  | 

Macro- 

tasking 

(COS) 

Micro- 

tasking 

(COS) 

Auto¬ 

tasking 

(UNICOS) 

Dispatch- 

Wait 

62 

2.8 

2.5 

Parallel  Loop 

- 

0.450*ch 
+  2.6 

0,/50'ch 
+  1.1 

Parallel  Case 

- 

1.0'pd 
+  2.3 

0.2' pd 
+  1.9 

Event 

Post-Wait 

16.1 

- 

Lock 

Set-Free 

5.5 

0.8 

0.95 

Barrier  Syn¬ 
chronization 

13.6 

- 

Tab.  2.  Overhead  caused  by  multitasking 
primitives:  All  times  are  measured  in 
microseconds  on  a  CRAY  Y-MP8/832.  ch 
is  the  number  of  chunks,  and  pd  stands  for 
the  number  of  process  resp.  case  directives. 

4.  Linear  Algebra  Kernels 

Using  multitasking  to  take  advantage  of  parallelism  within 
a  program  gives  the  chance  to  speedup  special  programs 
appreciably.  This  is  especially  true  for  linear  algebra  kernels 
which  are  heavily  used  in  large  application  packages. 
Moreover,  multitasking  strategies  and  corresponding 
problems  can  be  studied  in  detail  executing  such  programs. 

As  matrix  multiplication  has  a  simple  structure,  this  algo¬ 
rithm  is  often  used  to  evaluate  the  different  multitasking 
strategies  and  to  document  the  potential  benefit  for  such 
easy-to-use  algorithms.  There  exist  several  algorithms  and 
implementations  for  the  matrix  multiplication;  the  sub¬ 
routine  MXV  from  SCILIB  (CRAY  subroutine  library) 
can  be  used,  for  example,  to  perform  the  matrix-vector 
operations  (BLAS  2).  The  usage  of  microtasking  is  in¬ 
troduced  by  marking  the  DO  loop  with  a  DO  GLOBAL 
directive  (see  [  12]).  This  program  is  often  used  within  lit¬ 
erature  as  an  example  to  document  the  effectiveness  of 
microtasking;  there,  microtasking  is  used  in  a  natural  and 
easy  way.  Fig.  2  shows  the  speedup*  obtained  for  the  li¬ 
brary  routine  MXV  using  microtasking  and  autotasking  as 
well  as  the  macrotasking  version  of  a  parallelized  MXM 
subroutine  from  SCILIB  (BLAS  3).  It  can  be  seen  that 
for  a  vector  length  larger  than  128  neither  overhead  nor 
memory  organization  problems  are  of  significant  influence, 
and  the  total  performance  is  quite  satisfying. 
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Fig.  2.  Matrix  multiplication  using  BLAS  2  and 
BLAS  3  library  routines  with  multitasking 

Within  user  programs,  three  nested  DO  loops  are  normally 
used  as  an  algorithm  for  the  matrix  multiplication  instead 
of  library  routines.  This  implementation  makes  it  possible 
to  work  with  a  portable  code  running  on  most  of  the 
available  machines,  and  micro  tasking  can  be  specified  in 
an  easy  way  by  marking  the  outer  loop  with  a 
DO  GLOBAL  directive.  JJ.  Dongarra  has  introduced 
the  unrolling  technique  ([9]).  for  single-task  vectorizing 
programs,  in  particular  to  remove  memory  conflicts.  Us¬ 
ing  8-way  unrolling  the  memory-section  and  bank  conflicts 
([12,20])  are  significantly  reduced. 

Fig.  3  shows  the  speedup  obtained  for  both  versions  of  the 
DO  loop  algorithm  using  micro  tasking  which  documents 
the  benefit  from  unrolling  in  combination  with  microtask  - 
ing  (see  also  [12]). 

In  addition  to  matrix  multiplication,  some  linear  algebra 
algorithms  like  LU  and  Cholesky  decomposition  heavily 
used  in  application  programs  within  scientific  and  technical 
computing  environments  are  studied.  Results  for  these 
kernels  can  be  found  in  ([12,13]). 


5.  Parallel  Application  Program 

To  get  deeper  insight  into  the  multitasking  implementa¬ 
tions,  the  concepts  were  used  to  parallelize  a  numerical 
simulation  program  which  seems  to  be  typical  for  a  wide 
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Fig.  3.  Matrix  multiplication  using  DO  loops  and 
microtasking 

range  of  applications  running  at  our  site.  Numerical  sim¬ 
ulation  is  used  more  and  more,  for  example,  to  get  detailed 
information  about  crystal  growth  processes.  In  the 
Czochralski  crystal  growth  from  electrically  conductive 
melts,  the  application  of  an  external  magnetic  field  has  be¬ 
come  a  very  useful  technique  for  improving  crystal  quality. 
The  simulation  leads  to  a  system  of  coupled  partial  differ¬ 
ential  equations,  i.e.  incompressible  Navier-Stokes  and 
convective  heat  equations,  which  have  to  be  solved.  It 
provides  information  about  the  qualitative  and  quantitative 
difference  between  the  influence  of  a  stationary  transverse 
magnetic  field  and  a  vertical  magnetic  field  on  the  flow  and 
temperature  distribution  in  the  Silicon  melt,  for  instance, 
by  means  of  numerical  simulations  in  a  three-dimensional 
mathematical  model  of  the  Czochralski  crystal  growth.  A 
more  detailed  description  of  the  simulation  program  can 
be  found  in  [12,15,19]. 

This  program  is  used  to  carry  out  parameter  studies  to  get 
detailed  information  about  the  Czochralski  bulk  flow.  The 
sequential  version  of  the  simulation  program  is  highly  op¬ 
timized  for  vectorization,  running  after  some  code  modifi¬ 
cations  with  a  speed  of  about  195  MFLOPS  on  one 
processor  of  a  CRAY  Y-MP.  The  CPU  time  needed  to 
solve  one  of  these  problems  varies  from  10  to  30  CPU 
hours  on  one  processor  of  a  CRAY  Y-MP,  depending  on 
material  and  geometrical  parameters  and  the  considered 
time  interval.  The  simulation  program  has  a  simple  struc¬ 
ture,  and  using  FLOWTRACE  ([6])  it  was  obvious  that 
after  a  short  time  of  initialization  more  than  99%  of  the 


5  Time  measurements  are  done  by  calls  to  IRTC  on  a  dedicated  CRAY  Y-MP8/832.  The  operating  system  level  was 
COS  1.17  BF  1  resp.  UNICOS  5.0.  The  speedup  is  calculated  as  the  ratio  of  corresponding  times.  For  kernel  meas¬ 
urements  always  the  minimum  of  three  executions  is  taken  as  the  time  result  to  remove  the  additional  work  for  the  first 
TSKSTART  calls 
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total  CPU  time  is  spent  in  three  subroutines.  Based  on  this 
information  further  examinations  could  focus  on  these 
subroutines. 

Exploiting  parallelism  with  the  macrotasking  concept  on 
the  CRAY  X-MP,  for  each  of  these  subroutines  four  in¬ 
dependent  tasks  are  generated  with  the  TSKSTART 
primitive.  These  tasks  are  synchronized  with  the  new 
barrier  synchronization  primitive  introduced  with  the  COS 
1.17  operating  system.  Most  of  the  computational  work  is 
done  in  three-fold  nested  loops:  outer  I-loop  (running  from 
2  to  26),  then  the  K-loop  (running  from  2  to  34),  and  the 
inner  J-loop  (running  from  2  to  92).  An  example  of  a  typ¬ 
ical  loop  nest  can  be  found  in  Fig.  4. 

DO  20008  1=2, LM1 

RFTP=RFI ( I ) *DPH I I*S I 
DO  151  K=2,NM1 
DO  151  J=2,M 

CR(I , J,K)=CR( I , J,K)+SI*DRI* 

>  (DFF(I,J,K)-DFF(I+1,J,K)) 
CF( I , J ,K)=CF( I , J ,K)+RFTP* 

>  (DFF(I , J,K)-DFF(I , J+l ,K) ) 
CZ(I ,J,K)=CZ(I ,J,K)+SI*DZI* 

>  (DFF(I,J,K)-DFF(I,J,K+1)) 
151  CONTINUE 

20008  CONTINUE 

Fig.  4.  Typical  loop  nest 

The  cft77  compiler  vectorizes  only  on  the  innermost  DO 
loop.  Therefore,  efficient  vectorization  is  achieved  by  run¬ 
ning  the  longest  loop  in  vector  mode.  Using  macrotasking, 
the  outer  I-loops  are  partitioned  into  as  many  parts  as 
CPUs  are  available,  and  each  task  takes  its  own  work 
based  on  a  special  identifier  associated  with  this  task  (see 
Fig.  5). 

The  micro  tasking  parallelism  is  introduced  by  several 
DO  GLOBAL  directives  which  specify  independent  I- 
loop  iterations.  Because  of  the  micro  tasking  concept  which 
guarantees  explicit  synchronization  of  all  active  tasks  at  the 
bottom  of  a  micro  tasking  control  structure  (MCS)  no  ad¬ 
ditional  synchronization  is  needed. 


SUBROUTINE  VELO(ID) 

C  declarations 

COMMON/NOTASKS/NTSKS 
COMMON/EVENTS  / EVENT 1 
C  partition  DO  loop,  starting  at  2, 

C  ending  at  LM1 

11=2 
I2=LM1 

ITER=(I2-I1+1)/NTSKS 
ILAST=( I2-I1+1) -ITER*NTSKS 
II1=I1+(ID-1)*ITER+MIN(ID-1,ILAST) 
II2=I1+(ID  )*ITER+MIN(ID,ILAST)-1 
C  ... 

DO  111  1=111,112 
DO  111  K=2,NM1 
DO  111  J=2,M 
C  ... 

Ill  CONTINUE 

C  ... 

CALL  SYNCH( ID , EVENT 1 , NTSKS ) 

C  ... 

END 

Fig.  5.  Macrotasking  version  of  subroutine  VELO 
running  with  identifier  ID 

The  time*  is  measured  to  check  the  results  for  a  time  in¬ 
terval  of  the  simulation  process  which  represents  a  typical 
situation  where  the  flow  is  stabilized.  This  time  interval 
covers  a  few  seconds  of  the  real  crystal  growth  experiment 
and  corresponds  to  one  CPU  hour  of  one  processor  of  a 
CRAY  X-MP.  The  results  obtained  for  the  macrotasking 
version  show  that  the  overhead  introduced  by  the  addi¬ 
tional  library  calls  is  about  4%.  With  microtasking  the 
overhead  is  reduced  to  1%.  The  time  measurement  for 
both  versions  are  presented  in  Tab.  3. 

The  sequential  (i.e.  single-task)  reference  represents  the 
wall  clock  time7  needed  by  the  best  single  task  program 
version  simulating  3.4  seconds  of  the  simulation  process. 
The  next  row  provides  the  time  measurement  for  the 
multitasking  versions  when  four  CPUs  are  used. 


*  Time  measurements  were  done  by  calls  to  1RTC  on  a  dedicated  CRAY  X-MP/416.  The  operating  system  level  was 
COS  1.16  BF  3. 

1  Time  the  user  has  to  wait  for  the  result  on  a  dedicated  machine. 
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macro- 

tasking 

11 

sequential  reference  (oDe  program 
version  running  in  dedicated  mode) 

2747  s 

4  CPUs 

wall  clock  time 
(in  seconds) 

864  s 

822  s 

Speedup  obtained 

3.18 

3.34 

Tab.  3.  Speedup  results  on  CRAY  X-MP/416  with 
64  memory  banks  (COS  1.16  BF  3) 

In  these  experimental  results  the  dominant  factor  which 
limits  the  speedup  achieved  in  this  application  is  memory 
contention  which  introduces  an  additional  overhead.  Using 
two  CPU's,  up  to  98  per  cent  of  the  theoretical  speedup  is 
gained,  with  four  CPUs  only  90  per  cent  of  the  theoretical 
speedup  is  obtained  by  microtasking. 

With  both  concepts,  macro  tasking  and  microtasking,  it  was 
not  possible  to  get  a  reliable  timing  profile  for  each  of  the 
loop  nests  in  an  automatic  way.  This  has  changed  with  the 
autotasking  concept.  There,  a  parallel  region  can  be  iden¬ 
tified  with  one  loop  nest,  and  timings  can  be  inserted  to 
measure  this  nest.  The  preprocessor  PARANAU  ([12]) 
developed  by  the  author  was  adapted  to  insert  automat¬ 
ically  timing  routines  just  around  the  loop  nests.  Each 
loop  nest  is  numbered,  initialization  and  accumulation  of 
the  real  time  spent  in  these  nests  is  done  in  FORTRAN 
arrays  without  further  intervention. 

For  the  application  program,  about  40  loop  nests  are  en¬ 
countered.  From  the  timing  information,  about  15  loop 
nests  are  considered  to  be  important  (more  than  0. 1  %  of 
the  total  time  spent  in  the  loop).  Fig.  4  shows  a  typical 
loop  nest  (nest  21)  representing  about  28%  of  the  total 
CPU  time  even  for  a  short  simulation  interval. 
Autotasking  parallelism  is  introduced  by  DO  ALL  direc¬ 
tives  which  specify  independent  DO  loop  iterations.  These 
directives  are  inserted  automatically  by  fpp  which  is  part 
of  the  c/77  compiling  system.  The  autotasking  concept 
guarantees  explicit  synchronization  of  all  active  tasks  at  the 
bottom  of  the  DO  ALL,  and  no  additional  synchroniza¬ 
tion  is  needed.  To  compare  the  results  with  the  previous 
timings  shown  in  Tab.  3,  Fig.  6  shows  speedups8  using 
up  to  four  CPUs  of  the  CRAY  Y-MP  system,  executing 
an  autotasking  version  of  the  simulation  program  with 


some  minor  code  changes  and  additional  fpp  information 
directives.  Each  of  the  bars  represents  one  loop  nest,  on 
the  x-axis  the  total  time  (in  seconds)  spent  within  these 
loops  and  the  loop  nest  number  is  given,  the  y-axis  gives 
the  speedup  results  obtained  for  each  of  the  loop  nests  as 
well  as  for  the  whole  program.  Using  4  CPUs  a  speedup 
of  about  3.5  is  obtained;  this  improved  result  is  based  on 
the  new  memory  conflict  resolution  strategies  (see  [8]). 
The  overhead  of  about  1.5%  for  the  parallel  version  using 
one  CPU,  compared  with  the  optimal  sequential  version, 
documents  the  efficiency  of  the  implementation  as  far  as 
autotasking  primitives  are  involved. 
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Fig.  6.  Original  Czochralski  bulk  flow  simulation 
program  using  up  to  four  CPUs 


As  can  be  seen  in  Fig.  7,  increasing  the  number  of  CPUs 
does  not  lead  to  a  linear  increase  of  the  speedup.  This  is 
especially  true  for  code  regions  with  small  granularity  (i.e. 
two-fold  nested  loops  No.  20  and  No.  39),  where  the 
communication  overhead  exceeds  the  computational  work. 
Using  5  CPUs,  the  whole  program  can  be  executed  very 
efficiently  achieving  a  speedup  of  about  4.4,  but  for  higher 
number  of  CPUs  the  total  number  of  iterations  of  the  1- 
loop  (25)  which  have  to  be  scheduled  cannot  be  distributed 
in  a  load  balanced  wiy  to  7  or  even  8  processors.  The 
granularity  of  the  iterations  which  have  to  be  scheduled 
afterwards  is  rather  large,  using  8  CPUs  a  quarter  of  the 
real  time  used  within  this  loop  is  spent  in  the  last  iteration. 
Moreover,  loop  nest  No.  17  is  a  special  case  of  a  search 


8  Time  measurements  were  done  by  calls  to  the  timing  routine  IRTC  on  a  dedicated  CRAY  Y-MP8/832  running  native 
UNICOS  5.0. 
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loop  where  the  outer  loop  is  parallelized.  In  this  case,  the 
search  order  is  slightly  changed  which  leads  to  a  superlinear 
speedup  for  the  two-  and  the  three-processor  version.  On 
the  other  hand,  more  than  four  CPUs  do  not  increase  the 
speedup. 


e-1.00 -p 
150-1— 


□  CI50  /  COD  •»  Utomi'iy  (5  CPUs} 


S  030  /  CEO  i»  Autoltnh'n}  p  CPUs)  | 


930  /  am  *»  IMsuq  (8  CPUs)  - 


Nested  parallelism  is  not  supported  by  the  autotasking 
implementation,  therefore  other  possibilities  have  to  be 
considered.  As  a  consequence,  the  program  was  analyzed 
once  again  to  look  for  further  possible  program  transfor¬ 
mations.  After  checking  data  dependencies,  the  program 
was  restructured  by  hand  in  the  following  way: 

1)  The  indices  for  the  outer  two  loops  are  stored  in  the 
FORTRAN  arrays  UINDEX  and  IKINDEX  once 
at  the  beginning  of  the  program  (see  Fig.  8). 

DO  100  1=2, LM1 
DO  100  J=2,M 

1JLENGTH=IJLENGTH+1 
I J INDEX ( I JLENGTH , 1 ) =1 
I JINDEXC I JLENGTH , 2 ) =J 
100  CONTINUE 

DO  200  1=2, LM1 
DO  200  K=2 ,NM1 

IKLENGTH=  IKLENGTH+1 
IKINDEX ( IKLENGTH , 1 ) = I 
IK INDEX ( IKLENGTH , 2 )=K 
200  CONTINUE 


Fig.  7.  Original  Czochralski  bulk  flow  simulation 
program  using  up  to  8  CPUs 


Fig.  8.  Additional  FORTRAN  code  to  store  loop  in¬ 
dices  in  a  shared  array  (executed  once) 


number 
of  CPUs 

number  of 
iterations 
for  each  CPU 

remaining 
iterations 
(running  in 
parallel) 

speedup  limited 
by  Amdahl's 
law 

speedup 
achieved 
for  loop  21 

1 

25 

0 

1 

1 

2 

12 

1 

1.92 

1.92 

3 

8 

1 

2.77 

2.76 

4 

6 

1 

3.57 

3.54 

5 

5 

0 

5 

4.51 

6 

4 

1 

5 

4.88 

7 

3 

1 

6.25 

5.60 

8 

3 

1 

6.25 

5.93 

Tab.  4.  Load  balance  analysis  for  original  loop  21 

Tab.  4  gives  a  summary  about  the  speedup  limits  based 
on  load  balance  reasons.  The  number  of  iterations  which 
have  to  be  distributed  to  the  processors  is  much  too  small 
to  lead  to  satisfying  results  for  larger  number  of  CPUs. 


2)  All  three-fold  nested  loops  similarly  to  loop  21  arc 
translated  to  two  dimensional  loops  with  higher 
number  of  outer  loop  iterations  (see  Fig.  9). 


CHIC?  DO  ALL  SHARED ( . . . )  PRIVATE! -  - . ) 

DO  20008  IK=1, IKLENGTH 
I=IKINDEX(IK, 1) 

K=IKINDEX(IK,2) 

RFTP=RFI ( I )*DPH I I*S I 
DO  151  J=2,M 

CR(I , J,K)=CR(I , J,K)+SI*DRI* 

>  (DFF(I ,J,R)-DFF(I+1 , J,K) ) 
CF( I , J ,K)=CF( I , J ,K)+RFTP* 

>  (DFF(I,J,K)-DFF(I,J+1,K)) 
CZ(I,J,K)=CZ(I,J,K)+SI*DZI* 

>  (DFF( I ,J ,K)-DFF(I ,J ,R+1) ) 

15 1  CONTINUE 

20008  CONTINUE 

Fig.  9.  FORTRAN  code  for  the  modified  loop  nest 
No.  21 
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3)  The  search  loop  (loop  17,  see  Fig.  10)  is  treated  sim¬ 
ilar  to  loop  21,  and  in  this  case  the  collected  indices 
of  the  J-  and  the  K-loops  were  used.  In  addition,  the 
compiler  directive  CDIRS  SUPPRESS  FOUND  is 
used  to  assure  that  the  shared  variable  FOUND  is  not 
held  in  a  register  but  instead  immediately  stored  into 
the  memory. 

CHIC$  PARALLEL  SHARED( . . . )  PRIVATE( . . . ) 
CMIC$  DO  PARALLEL  CHUNKS I ZE(NPROC) 

DO  200  IJ=1,IJLENGTH 
I=IJINDEX(IJ,1) 

J=IJINDEX(IJ,2) 

CDIR$  SUPPRESS  FOUND 

IF(FOUND)GOTO  200 
DO  95  K=2,NM1 

95  IF(ABS(DFF(I,J,K)).GE.EPS)GOTO  99 

GOTO  200 
99  CONTINUE 

FOUND  = . TRUE . 

CDIR$  SUPPRESS  FOUND 
CHIC$  SOFT  EXIT 
GOTO  210 
200  CONTINUE 
210  CONTINUE 
CHICS  END  PARALLEL 

Fig.  10.  Modified  search  loop,  implemented  with 
autot  asking 

With  these  code  modifications  there  is  a  much  higher  level 
of  parallelism  (825  resp.  3003  loop  iterations  instead  of  25). 
Each  of  these  iterations  can  be  executed  in  parallel  without 
changing  any  data  dependencies. 

The  modified  sequential  program  is  running  a  little  bit 
slower  (about  185  MFLOPS),  therefore  in  Fig.  11  the 
time  results  of  the  modified  program  are  compared  to  the 
best  sequential  version  running  on  one  processor.  Using  8 
CPUs  a  speedup  of  about  6.1  is  achievable  leading  to  a 
total  speed  of  about  1,200  MFLOPS  on  a  CRAY  Y-MP 
multiprocessor  system. 


6.  Multiprogramming  and  Parallelism 

Today  most  of  the  CRAY  multiprocessor  systems  are  still 
used  within  a  multiprogramming  environment,  where  the 
individual  processors  execute  different  jobs  which  are 
totally  independent  of  each  other.  All  programs  compete 
for  the  available  resources.  With  multitasking,  they  may 


Fig.  II.  Modified  Czochralski  bulk  flow  compared 
to  the  best  sequential  version 

also  execute  different  parts  of  one  program  in  parallel.  The 
behavior  of  programs  using  multitasking  within  a  multi¬ 
programming  environment  and  the  influence  of  these  pro¬ 
grams  to  the  overall  work  load  is  of  great  interest  for  the 
rating  of  multitasking  concepts  with  respect  to  their  effi¬ 
ciency  in  practice. 

The  operating  systems  COS  and  UNICOS  allow  the  exe¬ 
cution  of  a  parallel  program  on  a  dedicated  machine  or  in 
a  multiprogramming  batch  environment.  A  dedicated  ma¬ 
chine  is  a  computing  environment  which  makes  all  its  re¬ 
sources  (for  example  all  CPUs)  available  to  one  program 
without  any  restrictions.  All  of  the  results  presented  above 
are  obtained  in  such  an  environment. 

Using  one  of  the  multitasking  concepts  within  a  multipro¬ 
gramming  batch  environment,  each  task  has  to  compete 
for  the  resources  with  other  tasks  of  the  same  program  and 
with  tasks  of  other  programs  executed  at  the  same  time. 
Tasks  of  a  program  with  higher  priority  may  force  re¬ 
sources  to  be  withdrawn  from  tasks  generated  by  programs 
with  lower  priority.  Usually,  the  turnaround  time  of  a 
program  is  higher  if  it  is  executed  within  a  multiprogram¬ 
ming  batch  environment  than  the  time  it  needs  on  a  dedi¬ 
cated  machine. 

To  evaluate  the  impacl  of  multitasking  onto  the  multipro¬ 
gramming  environment  and  to  ensure  reproducible  results, 
there  is  a  need  to  have  a  well-defined  and  constant  system 
load.  This  can  be  done  by  choosing  a  benchmark  to  gen¬ 
erate  a  multiprogramming  environment.  After  fixing  the 
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work  load,  some  test  jobs  are  inserted  into  the  job  mix  to 
study  various  effects. 


6.1.  Synthetic  Benchmark.  Programs 

Benchmark  programs  should  reflect  real  program  behavior. 
The  selection  of  such  benchmark  programs  is  an  important 
basis  for  the  performance  evaluation  which  may  also  have 
a  strong  influence  on  the  results  obtained.  For  example,  in 
([2,3])  a  benchmark  is  described  which  is  memory  bound 
and  therefore  leads  to  a  decreased  turnaround  time  when 
micro  tasking  is  used.  Usually,  there  is  no  way  of  studying 
the  same  benchmark  with  different  memory  requirements. 
In  general,  the  flexible  adaptation  of  such  real-life  applica¬ 
tion  programs  to  required  benchmark  parameters  (i.e. 
modified  run-time,  different  memory  requirements, 
changed  I/O  behavior  etc.)  is  limited  because  of  the  strong 
correlation  between  the  parameters.  This  is  especially  true 
in  a  multiprocessor  system  where  the  load  of  one  processor 
may  have  influence  on  the  performance  of  the  programs 
running  on  the  other  processors  (i.e.  bank  conflicts, 
I/O-blocking,  etc.).  On  the  other  hand,  there  is  a  strong 
requirement  for  such  flexible  modifications  of  the  work 
load  where  each  single  benchmark  program  should  reflect 
a  certain  behavior.  Therefore,  a  system  is  developed  sup¬ 
plying  synthetic  benchmark  programs  in  order  to  simulate 
any  given  work  load. 

On  shared-memory  multiprocessor  systems  like  CRAY 
X-MP  and  Y-MP,  there  are  only  a  few  program  charac¬ 
teristics  which  influence  the  activities  of  the  other  process¬ 
ors:  For  example,  if  the  memory  requirement  of  one  job 
is  the  total  memory  installed  at  a  site,  then  this  may  cause 
the  other  CPUs  to  execute  idle  loops.  In  this  category  of 
interprocessor  dependent  values  the  required  memory  size, 
the  memory  activities,  and  the  I/O  traffic  can  be  categor¬ 
ized.  Other  values  like  the  requested  CPU  time,  the  job 
priority,  and  the  M  FLO  PS  rate  are  benchmark  program 
parameters  which  have  no  direct  influence  on  jobs  running 
on  other  processors.  Based  on  these  considerations  the 
synthetic  benchmark  programs  are  generated  as  follows: 

Some  of  the  values  mentioned  above,  i.e.  the  memory 
size  (MSIZ),  the  job  priority  (P),  and  the  job  time  (T) 
are  values  which  can  be  assumed  to  be  static.  For 
each  set  of  such  parameters  different  versions  of  the 
JCL  are  used. 

Values  like  the  number  of  million  memory  references 
per  second  (MMREFS)  or  the  number  of  million  ar¬ 
ray  elements  transferred  per  second  to  an  I/O  device 


9  All  values  are  normalized  to  the  interval  (0,1). 


(MIOS)  are  highly  system  context  dependent.  For 
those  values  dynamic  decisions  have  to  be  made  to 
guarantee  a  certain  program  behavior  for  this  pro¬ 
gram. 

These  dynamic  decisions  are  based  on  information  from 
the  hardware  performance  monitor  (HPM)  which  is  avail¬ 
able  on  CRAY  X-MP  and  CRAY  Y-MP  systems  (see 
[6]).  The  HPM  supplies  a  set  of  eight  counters  which 
track  certain  hardware  related  events,  i.e.  the  number  of 
executions  of  specific  instructions,  memory  activities, 
MFLOPS  etc.  With  this  information  the  complete  pro¬ 
gram  characteristics  can  be  discovered  at  each  point  during 
execution.  A  decision  can  be  made  if  the  characteristics 
match  the  desired  program  properties.  A  set  of  kernels 
exists  (containing  presently  about  100  items)  with  different 
execution  properties.  In  the  synthetic  benchmark  pro¬ 
gram,  each  call  to  the  HPM  leads  to  a  decision  which 
kernel  is  the  next  one  to  be  executed  to  approximate  the 
described  properties.  At  the  end,  the  benchmark  program 
behavior  is  completely  composed  of  the  weighted  mean  of 
the  properties  of  all  kernels  selected  during  runtime. 

The  selection  of  the  kernels  is  done  in  the  following  way: 
each  program  kernel  is  represented  by  a  point 
p  =  (MMREFS, MIOS, MFLOPS)  in  a  three-dimensional 
space.  The  actual  program  performance  of  a  single 
benchmark  program  r  is  a  weighted  linear  combination  of 
the  performance  of  program  kernels  K„  referencing  a  point 
p„  using  t,  CPU  time.  At  decision  time  kernel  Ktj  is  chosen 
which  minimizes  the  Euclidean  distance9  ||...||  from  the 
given  reference  point  r,  which  means: 


select  K,t  so  that  A(ij)  =  min  A (/)  with 

r(M)  +  <,P, 


A(/):  = 


<Ia> 

i 


J- 1 

<Xa>  + '/ 

*- 1 


This  selection  procedure  enables  the  simulation  of  any 
three-dimensional  point  of  the  convex  hull  covering  all  the 
program  kernels  K,.  Any  point  r  = 
(MMREFS, MIOS, MFLOPS)  outside  of  this  hull  will  be 
approximated  by  the  point  of  the  convex  hull  which  mini¬ 
mizes  the  Euclidian  distance  to  this  point. 

The  sequence  of  kernel  selections  made  at  runtime  is  writ¬ 
ten  into  a  file.  Further  examinations  of  the  same 
benchmark  may  be  controlled  by  this  protocol  file.  This 
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leads  to  a  system  load  which  is  exactly  the  same  because 
the  same  sequence  of  kernels  K,,  is  executed  without  any 
respect  to  the  actual  performance. 

Besides  the  protocol  file,  the  synthetic  benchmark  program 
provides  run-time  output  reflecting  the  program  behavior: 

•  Start  time  and  end  time  of  the  job  (first  and  last 
statement  of  the  benchmark  program  are  timing  rou¬ 
tines); 

•  Required  program  characteristics  (specified  by  the  in¬ 
put  parameters); 

•  Obtained  program  characteristics  (based  on  the  infor¬ 
mation  supplied  by  the  HPM). 

In  general,  the  difference  between  the  given  program 
properties  and  the  benchmark  program  characteristics  ob¬ 
tained  are  rather  small.  Already  after  a  few  seconds  (2  to  5 
sec.),  the  approximated  values  are  very  close  to  the  given 
target  (if  the  target  lies  within  the  convex  hull  of  the 
kernels).  As  an  example,  Tab.  S  describes  a  subset  of  pa¬ 
rameters  of  the  execution  history  of  one  benchmark  pro¬ 
gram,  named  BE001.  The  table  shows  that  the  values 
specified  for  the  job  and  the  values  really  obtained  by 
kernel  selection  are  quite  the  same.  In  this  example,  within 
60  CPU  seconds  about  10,000  kernel  selection  decisions 
are  made.  If  the  kernel  selection  is  done  by  using  the  pro¬ 
tocol  file,  the  obtained  program  performance  is  about  2  to 
7  per  cent  higher  because  of  the  omitted  calls  to  the  HPM. 


program 
characteristics 
for  BE00I 

CPU  time 
(T) 

Memory  Size 
(MSIZ) 

MMREFS 

MFLOPS 

Specified  by 
the  input 

60.00 

2MW 

1 10.00 

80.00 

Obtained  by 
kernel  selection 

60.28 

2MW 

110.33 

80.19 

Tab.  5.  Program  characteristics  of  benchmark  pro¬ 
gram  BE001 

Using  synthetic  benchmark  programs  in  this  way,  the  only 
information  needed  for  simulating  a  real  workload  are  the 
program  characteristics  of  the  important  jobs  running  at  a 
site.  All  these  data  are  available  through  the  accounting 
data  and  the  HPM.  With  this  information  a  copy  of  the 
real  workload  can  be  executed  in  an  artificial  benchmark 
environment,  and  the  real  time  can  be  scaled  on  a  much 
lower  level. 


6.2.  The  Benchmark  Generation 

The  simulation  of  a  normal  work  load  implies  that  a  mix 
of  such  benchmark  programs  must  be  executed.  To  built 
such  a  job  mix  a  benchmark  generator  is  convenient.  The 
benchmark  system  generator  described  here  takes  the  re¬ 
quired  program  characteristics  for  all  benchmark  programs 
as  an  input  and  generates  the  set  of  synthetic  programs 
which  dynamically  approximate  the  characteristics  for  each 
program  by  using  the  HPM. 

Here,  a  special  benchmark  is  used  which  consists  of  14 
different  synthetic  programs  simulating  a  given  user  load. 
This  benchmark  was  executed  on  the  CRAY  X-MP/416 
under  the  COS  operating  system.  For  simplicity,  all  work 
load  jobs  are  running  with  priority  6,  the  memory  re¬ 
quirement  for  each  of  the  benchmark  programs  is  2  MW 
(Mega  Words).  The  other  job  characteristics  are  slightly 
changed  from  job  to  job,  for  example  the  memory  activ¬ 
ities  vary  from  50  to  1 10  MMREFS. 

Based  on  this  work  load  the  influence  of  additional  test 
jobs  on  the  total  system  is  examined.  As  test  jobs,  a  se¬ 
quential  as  well  as  a  parallel  version  of  the  numerical  sim¬ 
ulation  program  described  in  section  5  is  inserted  into  the 
job  mix.  Each  of  the  test  jobs  has  done  the  same  amount 
of  work.  About  4MW  main  memory  are  required  to  exe¬ 
cute  the  jobs,  and  they  ran  with  job  priority  7. 

6.3.  The  Benchmark  Analysis  Package 

The  total  benchmark  supplies  information  about  different 
influences  sequential  and  parallel  programs  have  on  the 
system  behavior.  With  respect  to  the  amount  of  data  de¬ 
scribing  the  system  behavior,  an  analysis  without  auto¬ 
matic  tools  is  not  practicable.  For  that  reason,  a 
benchmark  analysis  package  was  developed  by  the  author 
to  provide  collected  information  for  all  benchmark  pro¬ 
grams.  It  provides  tables  and  graphical  charts,  because  vi¬ 
sualization  may  be  helpful  in  finding  the  typical  benchmark 
behavior  as  well  as  hot  spots. 

Fig.  12  shows  one  of  the  visualizations  created  by  the 
benchmark  analysis  package.  It  is  a  benchmark  execution 
profile  where  a  sequential  job  is  inserted  as  a  test  job  (left 
hand  side).  For  each  job  the  chart  shows  the  start  and  end 
time  (in  seconds)  relative  to  the  start  of  the  first  job 
(BE001).  It  also  contains  information  how  long  the  jobs 
stayed  in  the  machine  and  how  much  CPU  time  was  con¬ 
sumed  (black  color).  The  wait  time  is  computed  as  the 
difference  between  turnaround  time  and  CPU  time.  In  the 
right  comer,  the  sum  of  the  CPU  and  the  wait  time  for  all 
jobs  and  for  the  test  job  is  printed.  The  total  time  indicates 
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the  real  time  for  the  whole  benchmark,  multiplied  with  the 
number  of  CPUs.  Fig.  13  shows  the  same  benchmark, 
but  executing  the  micro  tasking  version  of  the  simulation 
program.  Comparing  these  two  charts  it  can  be  realized 
that  inserting  micro  tasking  can  lead  to  both,  reduction  of 
turnaround  time  and  slightly  decreased  CPU  time  for  the 
whole  benchmark.  On  the  other  hand,  the  accumulated 
wait  time  is  increased  which  means  that  at  least  some  jobs 
have  to  stay  longer  in  the  machine. 

All  the  information  used  above  is  provided  by  the  HPM 
and  timing  routines  for  each  program.  But  there  is  also  a 
need  to  look  at  the  system  activities  to  get  a  knowledge 
about  the  rate  of  multitasking  concepts  with  respect  to  the 
interests  of  a  computer  center.  To  measure  system  activ¬ 
ities,  the  system  performance  monitor  (SPM)  can  be  used 
(see  [5]).  It  provides  several  blocks  of  information  con¬ 
cerned  with  all  important  system  activities.  Fig.  14  shows 
a  sketch  of  the  information  structure  available  by  the  SPM. 


EXP  requests  statistics 


Fig.  14.  Sketch  of  the  information  provided  by  the 
benchmark  analysis  package,  based  on  sys¬ 
tem  performance  monitor  (SPM)  data 
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On  the  first  level,  information  about  system-wide  ac¬ 
counting  data  like  user  time  and  wait-on-semaphore  time, 
but  also,  for  example,  the  time  required  by  operating  sys¬ 
tem  services  are  collected.  Each  of  these  services  is  pro¬ 
vided  by  a  system  task;  for  each  of  the  system  tasks  the 
total  execution  time  and  the  number  of  executions  can  be 
displayed  on  the  next  level.  For  special  system  tasks  like 
the  exchange  processor  or  the  job  scheduler  further  infor¬ 
mation  and  statistics  are  supplied  on  the  next  level.  By  the 
example  of  the  exchange  package  (EXP)  requests,  the 
presentation  capabilities  of  the  benchmark  analysis  pack¬ 
age  are  shown.  Either  the  total  information  for  all  96  EXP 
requests  is  given  in  one  bar  chart,  or,  as  in  Fig.  1 S,  selected 
values  are  collected  in  a  table. 


EXP  requests 

number 
of  calls 

i  Get  current  date  (2) 

1 

18 

Enter  message  In  logfile  (4) 

1.132 

Open  dataset  (8) 

261 

Close  dataset  (11) 

55 

|  Create/modify  local  dotcset  (12) 

440 

IGet  next  control  statement  (14) 

295 

!  Permanent  dataset  management  request  (17)1 

138 

\  Dispose  dataset  (21) 

16 

i  Return  accumulated  CPU-time  (23) 

154 

De'ay  job  (29) 

15 

1  Roll  job  (44)  | 

0 

1  Request  memory  (59) 

77 

generate  new  lask  (63) 

9 

Manipulate  HPM  (71) 

90  | 

Fig.  15.  Tool  output  for  EXP  requests 

With  this  tool,  the  impact  of  inserted  test  jobs  can  be  easily 
visualized.  For  the  benchmark  described  in  section  6.2,  the 
total  system  activities  measured  by  the  SPM  are  listed  in 
Tab.  6. 


Time 

(seconds) 

Inserted  Test  Jobs  | 

Sequential 
(priority  7) 

Micro- 
tasking 
(priority  7) 

Micro- 
tasking 
(priority  6) 

User 

670.582 

663.073 

667.498 

Wait-on- 

Semaphore 

0.000 

6.414 

0.358 

I/O 

98.561 

101.944 

99.292 

Operating 

system 

kernel 

2.150 

2.175 

4.077 

System 

wait 

0.122 

0.136 

0.127 

Operating 

system 

services 

6.932 

7.117 

7.645 

Tab.  6.  Time  distribution  from  the  system  perform¬ 
ance  monitor  Benchmark  programs  are 
running  with  priority  6,  system  load  was  kept 
fixed 

The  amount  of  user  time  is  nearly  the  same  as  the  time 
accumulated  by  the  single  jobs.  Parameters  which  cannot 
be  measured  on  a  job  basis  like  operating  system  activities 
can  be  stated  by  means  of  this  tool. 

As  a  first  result  of  the  investigations,  the  folklore  of 
strongly  increasing  operating  system  overhead  as  a  conse¬ 
quence  of  the  usage  of  multitasking  seems  to  be  disproved. 
The  additional  system  activities  are  only  slightly  raised,  the 
wait-on-semaphore  cycles  can  be  decreased  to  nearly  zero 
if  the  priority  of  the  multitasking  program  is  on  the  same 
level  of  priority  or  lower  than  with  the  other  benchmark 
jobs.  These  results  seem  to  be  typical  for  the  fine-grain 
multitasking  concepts;  for  macrotasking  jobs  at  least 
slightly  increased  CPU  times  are  observed.  In  future,  this 
benchmark  environment  can  be  used  for  further  intensive 
examinations,  and  it  will  also  be  implemented  under  the 
UNICOS  operating  system  on  the  CRAY  Y-MP  to  study 
the  efficiency  of  multitasking  concepts  in  a  UNIX  multi¬ 
programming  job  mix. 


7.  Conclusion 

The  functionality  of  the  multitasking  concepts  varies 
widely  from  the  parallel  execution  of  subroutines  with 
macrotasking,  the  introduction  of  parallelism  with  direc¬ 
tives  in  microtasking,  to  the  automatic  parallelization  of 
autotasking.  For  macrotasking,  the  availability  of  better 
tools  for  the  detection  of  parallelism  and  the  debugging  of 
programs,  working  on  an  interprocedural  basis,  is  urgent. 
The  new  autotasking  concept  is  a  superset  of  the  earlier 
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micro  tasking  implementation:  all  micro  tasking  primitives 
are  translatable  to  new  autotasking  directives,  although 
both  concepts  can  coexist  within  one  program  system. 
Therefore,  autotasking  will  become  the  more  important 
fine-grain  concept.  The  compiling  system  c/77  which  im¬ 
plements  the  autotasking  concept  provides  remarkable  ca¬ 
pabilities  in  the  field  of  automatic  parallelization.  The 
analysis  features  still  should  be  enhanced,  but  user  inter¬ 
vention  yet  remains  necessary  in  complex  cases. 

In  future,  the  influence  of  multitasking  concepts  in  a 
multiprogramming  environment  should  be  studied  more 
intensively  because  the  efficient  support  of  such  concepts 
by  the  operating  system  is  one  requirement  for  the  accept¬ 
ance  of  multitasking  for  production  codes  in  computer 
center  environments. 
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Abstract 

In  this  exploratory  paper  we  discuss  the  Daubechies  wavelet 
solution  of  boundary  value  problems  and  initial  boundary  value 
problems  for  ordinary  and  partial  differential  equations  in  one  space 
dimension.  The  theoretical  and  numerical  results  suggest  that  for 
the  above  class  of  problems  wavelets  provide  a  robust  and  accurate 
alternative  to  more  traditional  methods  such  as  finite  differences 
and  finite  elements.  The  one  dimensional  analysis  done  in  this  paper 
can  also  be  seen  as  a  necessary  step  to  the  solution  of 
mulidimensional  problems  where  various  technical  issues  remain  to 
be  resolved. 
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Introduction 


This  paper  describes  selected  developments  in  wavelet-based  numerical  methods  for 
solving  partial  differential  equations  (PDE's).  The  scope  of  this  paper  is  limited  to  a  small 
but  representative  set  of  equations  in  one  space  variable.  Methods  for  solving  problems  in 
two  and  three  dimensions  are  currently  being  developed. 

The  wavelet-based  methods  adapt  a  Ritz-Galerkin  technique  that  uses  functions 
associated  with  the  orthonormal  bases  of  compactly  supported  wavelets  constructed  in 
[Daubechies,  1988].  As  in  the  case  of  the  earlier  orthonormal  wavelet  basis  constructed  in 
[Meyer,  1985],  Daubechies'  bases  are  unconditional  bases  for  the  Sobolev  spaces  and 
therefore  provide  accurate  approximations  of  PDE's  solutions.  Furthermore,  the 
'multiresolution  analysis'  properties  of  these  bases,  described  in  [Mallat,  September  1987] 
and  [Meyer,  1986],  are  ideally  suited  for  multigrid  methods  and  adaptive  grid  refinement 
methods.  Finally,  the  compact  support  of  the  basis  functions  makes  the  'wavelet  transform' 
algorithm,  described  in  [Mallat,  May  1987],  extremely  efficient  for  computing  numerical 
solutions  of  PDE  problems. 

Numerical  solution  of  a  PDE  problem  requires  a  discretization  method  that  reduces  the 
problem  to  finding  the  solution  of  an  algebraic  equation  (AE)  in  a  finite  number  of 
unknowns.  The  latter  problem  is  amenable  to  digital  computation.  PDE’s  involve  functions 
that  model  spatially  distributed,  and  possibly  time  varying,  physical  quantities  such  as 
temperature,  velocity,  or  displacement,  and  differential  operators  that  model  the  physical 
processes  which  determine  the  static  or  dynamic  behavior  of  these  quantities.  Discretization 
methods,  such  as  finite  differences,  finite  elements,  and  spectral  methods,  represent  the 
solution  function  u  by  an  approximation  v  defined  by  N  discrete  parameters.  Then  the 
differential  operator  and  the  constraints  such  as  initial  values  and  boundary  conditions  are 
approximated  by  algebraic  operations  involving  these  parameters.  This  results  in  an  AE 
whose  exact  solution  determines  v.  A  discretization  method  is  effective  if  the  truncation 
error  u-v  tends  to  0  'rapidly'  as  N  increases. 

Numerical  solution  of  a  PDE  problem  also  requires  a  method  for  solving  the  resulting 
AE  that  results  from  discretization.  It  is  only  required  to  obtain  an  approximate  solution  w 
of  the  AE  such  that  the  algebraic  error  v-w  is  comparable  to  the  truncation  error  u-v.  Since 
the  total  error  e  =  u-w  satisfies  e  =  truncation  error  +  algebraic  error,  if  v-w  ■=  u-v  then 
additional  computation  is  more  effectively  utilized  by  increasing  the  number  of 
discretization  parameters  to  decrease  the  truncation  error.  Algebraic  solution  methods 
consist  of  direct  methods,  such  as  LU  factorization,  which  yield  a  solution  accurate  to 
within  finite  word  length  limitations,  and  iterative  methods,  such  as  conjugate  gradients  and 
multi  grid  ,  which  yield  a  solution  whose  accuracy  increases  with  the  number  of  iterations. 

The  analytical  and  computational  properties  of  scaling  functions  and  wavelet  functions 
provide  powerful  numerical  methods  for  discretizing  PDE's  and  for  solving  the  resulting 
AE's.  Our  paper  is  organized  as  follows. 
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Section  1  reviews  the  construction  of  the  scaling  functions  and  wavelet  functions 
described  in  [Daubechies,  1988]  and  derives  the  analytical  and  computational  properties 
that  address  the  requirements  for  solving  PDE's.  Sobolev  approximation  properties  of  the 
scaling  functions  in  Hm(R)  and  of  their  restrictions  to  (0,1)  in  Hm(0, 1)  are  derived  as  a 
consequence  of  the  vanishing  moments  properties  of  the  associated  wavelets.  These 
properties  provide  effective  wavelet-based  discretization  methods.  The  computational 
properties  of  wavelets  are  derived  from  the  hierarchical  structure  of  the  scaling  functions. 
These  properties  include  algorithms  for  expanding  functions  in  wavelet  bases  from  thier 
sampled  values  and  for  representing  differential  operators  with  respect  to  wavelet  bases. 

Section  1  addresses  general  results  concerning  the  existence,  uniqueness,  and 
approximation  of  the  solutions  of  linear  variational  problems  in  Hilbert  spaces.  Examples 
concerning  elliptic  boundary  value  problems  in  one  space  dimension  are  used  to  illustrate 
some  of  the  above  notions. 

Section  3  discusses  wavelet-based  solutions  of  second  order  linear  elliptic  PDE's  with 
Neumann  and  Dirichlet  boundary  conditions.  A  variational  formulation  of  the  Dirichlet 
problem  is  used  to  reduce  it  to  a  small  number  of  Neumann  problems  involving  Lagrange 
multipliers.  A  Galerkin  basis  consisting  of  translates  of  a  dilated  scaling  function, 
truncated  so  as  to  have  support  in  the  interval  (0,1),  is  used  to  discretize  the  Neumann 
problem.  The  resulting  AE's  are  solved  using  LU  and  Choleski  factorization  methods. 
Numerical  results  are  presented  to  illustrate  the  effectiveness  of  the  new  methodology. 

Section  4  discusses  the  wavelet-based  solution  of  singularly  perturbed  second  order 
linear  elliptic  boundary  value  problems.  Solutions  of  these  problems  may  exhibit  boundary 
layers  requiring  higher  resolutions  where  the  strong  gradients  exist.  A  domain 
decomposition  method  is  used  to  compute  and  match  wavelet  based  solutions  in  the  low 
and  high  gradient  regions. 

Section  5  discusses  multigrid  methods  for  solving  the  second  order  linear  elliptic 
problems  with  periodic  boundary  conditions. 

Section  6  discusses  wavelet-based  solutions  of  linear  and  nonlinear  parabolic  equations 
in  one  space  dimension.  Here  space  approximations  using  wavelets  are  combined  with  rime 
stepping  methods  to  solve  the  one  dimensional  heat  equation  and  the  regularized  Burgers 
equation.  Furthermore,  in  the  case  of  the  Burgers  equation,  we  discuss  the  use  of  basis 
consisting  of  wavelets  together  with  scaling  functions  in  order  to  filter  spurious  oscillations 
developoed  in  the  shock  region. 

Section  7  discusses  the  wavelet-based  solution  of  the  linear  advection  equation  du/9t  + 
9u/9x  =  0.  As  in  Section  6  a  wavelet  approximation  for  the  space  variable  is  combined  with 
time  stepping  methods. 

Finally,  Section  8  provides  further  comments  together  with  our  conclusions  on  these 
preliminary  experiments  with  wavelets  as  tools  for  solving  differential  equations. 
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1  Description  and  Basic  Properties  of  the 
Daubechies  Scaling  Functions  and  Wavelet 
Functions 

1.1  Description  of  Daubechies'  Functions. 

The  Daubechies  scaling  functions  and  wavelet  functions  constructed  in  [Daubechies,  1988]  are 
considerably  more  complex  and  elusive  than  the  more  familiar  elementary  functions.  Therefore,  it  is 
convenient  to  consider  these  functions  as  generalizations  of  a  simpler  set  of  functions  described  below. 

Define  functions  <)>  and  v|/  by  : 

(1.1)  <!>(x)  =  1  for  0  £  x  <  1,  else  <(>(x)  =  0, 

(1.2)  v(x)  =  1  for  0  £  x  <  1/2,  y(x)  =  -1  for  1/2  <  x  <  1,  else  t)>(x)  =  0. 

and  define  k(x)  =  2j/2<J>(2ix-k)  and  yjk(x)  =  2J/2y(2Jx-k)  for  any  pair  of  integers  (j,k).  The 
functions  [t^  k :  j  £  0,  k  =  0,...,2M  (  are  the  classical  Haar  functions,  first  introduced  in  [Haar,  1910] 
to  provide  tin  orthonormal  basis  for  L2(0,1).  Die  Daubechies  scaling  functions  and  wavelet  functions 
are  generalizations  of  the  Haar  scaling  functions  (0jk)  the  Haar  wavelet  functions  (yj  k).  We  will 
describe  the  properties  of  the  Haar  functions  using  subspaces  Vn  and  Wn  of  L2(R),  n  any  integer, 
defined  by  : 

(1.3)  Vn  =  closure  of  linear  space  spanned  by  {  4>IV_k :  k  an  integer), 

( 1 .4)  Wn  =  closure  of  linear  space  spanned  by  (  vynJ[ :  k  an  integer) . 

Then  the  following  properties  hold: 

(1.5)  Vn+)  z>  Vn  for  every  integer  n, 

(1.6)  Closure  (U  Vn)  =  L2(R), 

(1.7)  {<>Tvlc :  k  an  integer]  is  an  orthonormal  basis  for  Vn  for  every  integer  n, 

(1.8)  Wn  is  the  orthogonal  complement  of  Vn  in  Vn+j, 
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(1.9)  (\j/n>k  :  k  an  integer)  is  an  orthonormal  basis  for  Wn  for  every  n, 

(1.10)  and  have  compact  support  for  all  integers  j,  It, 

(1.1 1)  J  i)>j  k(x)dx  =  2'j^  and  J  yjk(x)dx  =  0  for  all  integers  j  and  k. 

A  further  consequence  of  properties  (1.5),(1.6),  and  (1.8)  is 

(1.12)  L2(R)  =  Vn  0  8  Wj  =  ^  0Wj,  for  every  integer  n, 

j>n  j 

Now  let  Pn  denote  the  orthogonal  projection  of  L2(R)  onto  Vn  and  let  Qn  denote  the  orthogonal 
projection  of  L2(R)  onto  Wn,  therefore  Pn+1  =  Pn  +Qn.  Then  for  any  f  6  L2(R),  Pn(f)  yields  a  2"n 
low  resolution'  approximation  to  f  and  a  'higher  resolution'  approximation  Pn+](f)  may  be  obtained 
by  adding  a  'high  frequency'  component  <W0- 

For  every  integer  N  £  1  Daubechies  constructs  a  pair  of  functions  0  and  V|/  that  are  the  functions  in 
(1.1),(1 .2)  for  N  =  1  and  that  generalize  these  functions  for  N  >  1.  Her  construction  involves  the 
following  steps: 

Step  1  Construct  a  finite  sequence  h(0),...,h(2N-l)  satisfying  : 

(1.1  i )  X  h(k)h(k+2m)  =  5^,  for  every  integer  m, 
k 

(1.14)  X  h(k)  =  V2, 
k 

(1.15)  ^  g(k)km  =  0,  whenever  0  £  m  <  N-l,  where  g(k)  =  (-l)*ch(-k+l). 

k 

(Note  that  for  m  =  0,  equation  (1.15)  is  implied  by  equation  (1.13)  and  equation  (1.14)) 

Step  ~  Construct  the  trigonometric  polynomial  m^y)  by 

(1.16)  m0(y)  =  'll£l  h(k)exp(iky), 

k 
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Step  3  Construct  the  scaling  function  <)>  so  its  Fourier  transform  <{/■  satisfies  : 


(1.17)  <t>A(y)  =  (lW2n)  n  m0(2'ky), 

Step  4  Construct  the  wavelet  function  y  by 

(1.18)  y(x)  =  X  g(k)4>(2x-k). 

k 

For  N  >  1  these  functions  define  sets  {<}>nk},  (Vn,k  )  and  subspaces  Vn  and  Wn  of  L2(R) 
satisfying  properties  (1.3)-(1.1 1)  and,  in  addition,  the  following  properties: 

(1.19)  <t>n  k  =  X  h(j-2k)<j>n+1j  and  ynk  =  X  g(j-2k)<|>n+i d  for  any  integer  n, 

j  j 

(1.20)  support(<t>nk)  =  [2  nk,2  n(k+2N-l)],  support^)  =  [2-n(k+l-N),2-"(k+N)], 

(1.21)  I  yjk(x)xmdx  =  0  for  all  integers  j  and  k  and  any  integer  0Sm<N-l, 

( 1 .22)  4>j  k  and  \j/j  k  e  C^N) »  space  of  Holder  continuous  functions  with 
exponent  X(N),  where  X(2)  =  2-log2(l  W3)  =  .5500, 

X(3)  =  1.087833,  X(4)  =  1.617926,  and  MN)  -  0.3485N  for  large  N. 

Remark  i.l  Equations  (1.19)-(1.21)  are  derived  from  the  properties  of  <]>^k  and  \yn k  in  [Daubechies, 
1988].  Property  (1.22)  is  derived  in  [Daubechies  and  Lagarias],  Graphs  of  the  Daubechies  scaling 
functions  and  wavelet  functions  for  2  <  N  <  4  are  illustrated  in  Figure  1.1. 

1.2  Approximation  Properties  of  Daubechies'  Functions. 

For  m  >  1  and  Q  an  open  interval  of  R  (e.g.  £2  =  (0,1)  or  £2  =  R)  define  spaces: 

(1.23)  H°(£2)  =  L2(£2)  with  the  standard  Hilbert  Space  inner  product  <.,.>, 

(1.24)  I-F^D)  =  (f  e  Hm'1(£2) :  f '  e  Hm'1(D)}  with  Hilbert  Space  inner  product 
(„.)m  defined  inductively  by  (.,.)0  =  <„.>  and  (f,g)m  =  <f,g>  +  (f  ’,g  and 
associated  norm  II.  II, 

m 
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(1.25)  D(£2)  =  space  of  infinitely  differentiable  functions  with  compact 
support  contained  in  £2  (i.e.  if  £2  =  (0,1)  these  functions  'vanish'  at  0  and  1) 

(1.26)  H0m(£2)  =  closure  of  D(£2)  in  H™(£2)  with  respect  to  the  norm  ll.llm. 

(1.27)  for  any  fixed  integer  N  £  1  and  any  integer  n,  Vn(W)  =  restriction  to  £2 
of  functions  in  Vn  where  Vn  is  defined  as  in  (1.3)  by  the  scaling  function 
corresponding  to  N,  i.e.  Vn(R)  =  Vn. 

For  any  integer  m  >  0,  closed  interval  I,  and  function  f :  I  — >  R  we  define 

(1.28)  Em(f,I)  =  min  (  max  lf(x)  -  p(x)l  ), 

p(x)  x  6  I 

where  p(x)  ranges  over  all  polynomials  of  degree  <  m.  Then 

Lemma  1  If  N  £  1,  f  e  D(R)  with  B  =  max(lf^(x)l),  then  for  any  I  =  [a,b], 

(1.29)  EN.1(f,I)  <;  2B(b-a)N/(4NN!). 

Proof  This  is  one  form  of  Jackson's  theorem,  ref.  [Dahlquist  and  Bjorck,  1974], 

Lemma  2  If  f  and  B  are  as  above  and  \|t  is  a  wavelet  with  N  £  1 ,  then 

(1.30)  l<cf,\|/j  k>l  <  C2‘j(N+1/2)  where  C  =  2B(2N-1)n+1^/(4nN!),  for  all  integers  j,  k. 

Proof  Follows  from  equation  (1.21),  lemma  1.1,  and  Schwarz's  inequality. 

Lemma  1.3  If  f  and  C  are  as  above,  let  m  £  0,  and  choose  N  >  m  such  that  the  associated  scaling 
function  <t>  and  wavelet  function  y  belong  to  Hm(R)  (the  existence  of  such  an  integer  is  implied  by 
property  (1.22)),  let  n  be  any  integer,  let  Vn  be  as  in  (1.3)  and  let  Pn  denote  orthogonal  projection 
onto  Vn,  then 

(1.31)  Ilf  -  Pn(f)Hm  S  X  X<f.V,>«  "Vm  *  D2-"<N  m>. 

j  £  n  k 
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where  D  =  CEII\|dlfll/(l-2‘(N'm))  and  E2  =  (2N-l)x(length  of  smallest  interval  containing  support  (0). 
Proof  Follows  from  (1.30)  and  the  fact  that  ll\i/j  kllm  <  2jmll\}/llm. 

I^emma  1.4  The  set  of  restrictions  of  functions  in  D(R)  to  fl  is  dense  in  ^(W)  for  every  m  >  0. 

Proof  This  is  the  classical  density  theorem  which  follows  from  the  classical  prolongation  theorem  for 
Sobolev  spaces,  [Adams,  1975],  [Brezis,  1983] 

We  are  now  prepared  to  derive  the  main  result  of  this  section  that,  together  with  the  general 
approximation  results  in  section  2,  provides  the  mathematical  justification  for  wavelet-based  solution 
methods  for  boundary  value  problems  discussed  in  later  sections. 

Theorem  1,1  Let  m  >  0  and  choose  N,  <}>,  and  \jr  as  in  lemma  1.3,  let  g  s  then  for  any  e  >  0 

there  exists  an  integer  n  such  that 

(1.32)  Hg-hllm  <  E, 

where  h  is  the  restriction  to  fll  of  a  function  in  Vn. 

Proof  Follows  from  lemma  1.3  and  lemma  1.4. 

Remark  1.2  It  follows  from  property  (1.22)  that  for  m  =  1,  N  =  3  satisfies  the  hypothesis  of  theorem 
1.1.  This  choice  is  adequate  to  treat  second  order  elliptic  boundary  value  problems  in  one  space 
dimension.  For  m  =  2,  N  =  7  satisfies  the  hypothesis  of  theorem  1.1  and  hence  is  adequate  to  treat 
fourth  order  elliptic  boundary  value  problems. 

1.3  Computational  Properties  of  Daubechies'  Functions. 

Computational  properties  are  mathematical  properties  that  are  related  to  algorithm  requirements  and 
algorithm  performance.  This  section  will  discuss  computational  properties  related  to  both  direct 
algorithms  and  iterative  algorithms. 

1.3.1  The  Mallat  Transform. 

Fix  an  integer  NS  1  and  let  4>  and  be  the  associated  scaling  function  and  wavelet  function.  For 
any  integer  n  let  Vn  and  Wn  be  defined  as  in  (1.3), (1.4).  Then  from  properties  (1.7)-(1.9)  it  follows 
that  every  f  e  Vn+1  admits  the  representation 
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(1.33)  f(x)  =  2  c(k)<fn+i.k  =  2  a(k)^k  + 

k  k  k 

where  a,b,  and  c  are  square  summable  sequences.  Then  by  property  (1.19)  the  sequences  a  and  b  are 
determined  from  the  sequence  c  by  a  unitary  transformation  T:  L2(Z)xL2(Z)  — >  L2(Z),  given  by 


(1.34) 


T(a,b)  =  c  where  c(k)  = 


2  h(k-2j)a(j)  +  2  g(k-2j)b(j). 
J  j 


The  inverse  T1  :  L2(Z)  -->  L2(Z)xL2(Z)  is  given  by 


(1.35)  T'1  (c)  =  (a,b)  where  a(k)  =  2  h(j-2k)c(j)  and  b(k)  =  2  g(j-2k)c(j). 

j  j 

Remark  1.3  The  transformation  T  is  called  the  Mallat  transformation  and  is  described  in  detail  in  ref. 
[Mallat,  May  1987],  For  any  integer  n,  integer  J  £  1,  and  f  e  L2(R),  the  functions  Qn  ](f).Qn_2(0.  ... 
,Qn.j(f),Pn.j(f)  can  be  computed  by  first  computing  Pn(f)  and  then  applying  the  Mallat  transform  J 
times.  This  yields  a  'multiresolution'  analysis  of  a  function  f.  The  computation  of  T  or  T'1  requires 
only  N  multiplies  and  N-l  additions  for  each  ’output  value'. 

1.3.2  Computing  the  Expansion  of  Functions. 

We  will  now  discuss  how  to  compute  the  expansion  of  functions  with  respect  to  a  'wavelet'  basis 
from  sampled  values  of  the  function.  This  'expansion'  problem  is  formalized  as  follows.  Fix  an 
integer  N  >  1  and  let  <j>  and  \|/  be  the  associated  scaling  function  and  wavelet  function  as  in  Section 
1.3.1.  Let  n  be  any  integer,  let  Vn  be  defined  by  (1.3),  and  let  Pn  denote  the  orthogonal  projection  of 
L2(R)  onto  Vn.  The  'expansion'  problem  consists  of  computing  an  approximation  to  P]1(0  where 
f  6  L2(R).  This  is  equivalent  to  the  problem  of  evaluating  an  approximation  to  the  integrals 

2'n(k+2N-l) 

(1.36)  c(k)  =  X  <}>JVjc(x)f(x)dx  for  every  integer  k. 

2'nk 


This  requires  an  approximate  knowledge  of  f(x)  over  each  dyadic  interval  ln_k  having  the  form 
Ink  =  [2-nk,2-"(k+2N-l)]. 


i 
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Our  general  approach  to  this  approximation  problem  is  described  as  follows.  For  each  integer  k  let 
gk(x)  be  a  function  (perhaps  a  distribution)  defined  on  Ink  that  approximates  (perhaps  in  the  weak 
sense)  the  restriction  of  f  to  1^,  then  define 

2'n(k+2N-l) 

(1.37)  d(k)  =  I  <V.k(x)gk(x)dx  for  every  integer  k, 

2'nk 


to  approximate  c(k)  for  every  integer  k,  then  define  an  approximation  Pn~(f)  to  Pn(f)  by 


(1.38)  Pn~(f)  =  Xd(lc)<t>n.k 

k 

The  resulting  error  function  can  be  related  to  the  errors  ld(k)-c(k)i  by  applying  Holder  type  inequalities 
to  the  equation 

k  <  2nx 

(1.39)  Pn-(f)(x)-Pn(f)(x)  =2(d(k)-c(k))^|c(x) 

k  >  2"x-2N+l 

We  describe  two  specific  techniques  for  computing  approximations  d(k)  to  the  expansion 
coefficients  c(k)  of  Pn(f)  from  sampled  values  of  f  at  dyadic  rationals  (f(2'n'sj) :  j  e  Z)  where  s  S  0. 
Their  performance,  measured  by  the  computational  complexity  and  by  the  asymptotic  bounds  for  the 
error  ld(k)  -  c(k)l  as  n  and  s  are  large,  is  discussed  under  the  assumption  that  f  is  infinitely 
differentiable  and  that  <p(x)  has  Holder  exponent  X(N)  >  1,  (i.e.  N  >  3  and  0(x)  is  continuously 
differentiable).  These  assumptions  are  made  in  Section  5  to  discuss  the  computational  complexity  of 
wavelet-based  multigrid  methods  for  certain  classes  of  problems. 

Approximation  Technique  1  Choose  an  integer  s  >  0  and  let  gk(x)  be  the  measure 

j  =  2s(k  +  2N  -1) 

(1.40)  gk(x)  =  2-n-s  X  f(2-"-s  j)  5(x  -  2-«*s  j) 

j  =  2sk 

to  obtain 

2S(2N  -  1) 

(1.41)  d(k)  =  2-s-n/2X0(2-Sj)  f(2'nk  +  2-n-Sj) 

j  =  0 
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This  requires  =  2S(2N  -  1)  operations  to  compute  each  d(k)  and  therefore  it  requires  =  2n+s(2N  -  1) 
operations  to  compute  the  expansion  of  Pn(f)  per  unit  interval.  Under  the  assumptions  on  f  and  <)> 
above,  it  can  be  proved  that  there  exists  constants  Cj  and  Cj  such  that 

(1.42)  ld(k)  -  c(k)l  <  C,2*3n/2-s  +  Cj 2-3n/2-Xs  where  k  =  min{  1,  X(N)  -  1 ) 

Remark  1.4  In  [Resnikoff,  March  1988]  explicit  expressions  for  the  values  of  <]>  at  dyadic  rationals  as 

rational  functions  of  the  parameters  (h(i)  :  i  =  0 . 2N-1 )  that  define  $  in  equations  (1.1 3)-(l .  1 8)  are 

derived  and  the  following  equation  is  derived 

2S(2N  -  1 ) 

(1.43)  2-sX  <t>(2'sj)  =  1 

j  =  0 

This  important  result  is  required  to  derive  relation  (1.42)  as  well  as  to  compute  d(k)  in  equation  (1.41). 

Remark  1.5  Relation  (1.42)  implies  that  if  k<  1,  a  more  efficient  algorithm  for  computing  the 
expansion  Pm(f)  is  obtained  by  combining  the  approximation  technique  above  with  the  inverse  Mallat 
transform  as  follows. First  choose  an  integer  n  >  m  and  an  integer  s  £  0  and  calculating  (d(k))  for  the 
expansion  Pn(f).  Then  compute  n  -  m  stages  of  the  inverse  Mallat  transform  as  described  in  Remark 
1.3  to  obtain  Pm(f).  Optimizing  the  choice  of  n  and  s  requires  specification  of  the  required  accuracy 
and  the  value  of  the  constants  Q  and  Cj  in  relation  (1.42). 

Approximation  Technique  2  Choose  integers  s  £  0,  p  2  1  and  q  and  choose  gk(x)  to  be  the  LagTange 
interpolating  polynomial  to  f(x)  at  the  set  of  points  {2'n_s  (2sk  +  q  +  j) :  1  <  j  <  p).  This  polynomial 
is  defined  by 

j  =  P 

( 1 .44)  gk(x)  =  X  f(2*n_s  (2sk  +  q  +  j  ))Lj  (2"+s  x  -  2Sk  -  q) 

j=l 

where 

i  =  p 

(1.45)  Lj(y)  =  IT  [(y  -  i )  /  ( j  -  i )],  for  j  =  l,...,p 

i  =  1 

to  obtain 
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j  =p 

(1.46)  d(k)  =  X  M(j)  f(2"n's  (j  +  2Sk  +  q)) 

j  =  l 

where 

2N-1 

(1.47)  M(j)  =  2-n/2/  Lj  (2sx  -  q)<ji(x)dx  for  j  =  l,...,p 

0 

This  requires  p  operations  to  compute  each  d(k)  and  thus  =  2np  operations  to  compute  the  expansion 
of  Pn(f)  per  unit  interval.  Under  the  assumptions  on  f  and  <|>  above,  Newton's  general  interpolation 
formula  [Dahlquist  and  Bjorck,1974]  implies  there  exists  a  constant  C3  such  that 

(1.48)  ld(k)  -  c(k)l  <C32’(p  +  1/2)n 

Clearly,  for  p  >  2,  this  technique  is  asyptotically  more  efficient  than  approximation  technique  2. 
Furthermore,  the  M(j)  are  linear  combinations  of  the  0,...,p  -  moments  of  <j).  The  latter  can  be 
expressed  as  rational  functions  of  the  (h(i)  :  i  =  0,...,2N-1}  using  the  results  in  [Resnikoff,  March 
1988], 

1.3.3  Representing  Differential  Operators  in  Wavelet  Bases 

Let  m  >  0  and  let  N  >  m  be  such  that  the  associated  scaling  function  <])  e  Hm(R).  Let  Vn  be  defined 
as  in  equation  (1.27)  and  let  Pn  denote  the  projection  of  V  onto  Vn.  Let  Hm(Rp)  denote  the  set  of 
functions  in  FRnfR)  that  are  periodic  of  period  1  (intuitively,  Rp  represents  a  circle)  together  with  the 
inner  product 


m  x  =  1 

( 1 .49)  (f,g)m  =2  J  PJ>(x)g®(x)dx, 

j  =  0  x  =  0 

and  associated  norm  ll.llm.  Furthermore,  let  Vn(Rp)  denote  Vn  n  H^fRp).  Then  clearly 

(1.50)  Vn(Rp)  =  (f  e  HHi(R) :  f  =  X  cOc)^  and  c(k+2")  =  c(k)  for  all  k} 

k 

Let  m,  N,  and  (J)  be  as  above  and  let  V  denote  either  HTn(fl),  for  some  interval  Q.  of  R,  or  H^fRp). 
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For  any  integer  n  let  Vn  denote  either  Vn(Q),  defined  as  in  equation  (1.27),  or  Vn(Rp),  defined  as  in 
equation  (1.50).  For  any  operator  A  :  V  — >  V*  (the  dual  space  of  V)  and  for  any  integer  n,  let  An  :  V 
~>  V*  denote  the  operator  defined  by 

(1.51)  <An(0,g>  =  <Pn*(A(Pn(f))),g>  =  <A(Pn(f)),Pn(g)>,  for  every  f,  g  e  V, 

where  <.,.>  :  V*xV  ->  R  denotes  the  canonical  pairing  and  Pn*  denotes  the  adjoint  of  Pn.  Clearly, 
the  sequence  (An)  provides  a  sequence  of  approximations  to  A  as  n  increases.  Furthermore,  the 
operator  An  is  defined  by  the  set 

(1.52)  (a(n,i,j)  =  <An(bj),bj>  :  { b£ )  form  a  basis  for  Vn). 

The  techniques  in  [Resnikoff,  March  1988]  provide  explicit  formuli  for  calculating  (a(n,i,j))  where  A 
is  any  differential  operator  whose  coefficients  are  piecewise  polynomial  functions  whose  associated 
intervals  have  dyadic  rational  endpoints.  In  this  case,  each  a(n,ij)  is  a  rational  function  of  the  scaling 
parameters  (h(k)}  that  define  4>  in  equations  (1.13)-(1.17).Furthermore,  a(n,i,j)  =  0  whenever  li-jl  > 
2N-1. 

For  the  remainder  of  this  section  we  will  assume  that  either  V  =  H^KR)  or  V  =  IFTRp)  and  that  A 
is  a  differential  operator  having  constant  coefficients.  Then  clearly  for  any  integer  n,  a(n,i,j)  is  a 
function  of  n  and  of  j-i  which  we  will  denote  by  a(n,j-i)  (arithmetic  on  indices  will  be  considered 
modulo  2n  if  V  =  Hm(Rp)).  Identify  Vn  with  its  dual  in  L2(R).  Then  the  operator  An,  with  respect  to 
either  the  basis  {(j>nk  :  k  an  integer)  of  H^HR)  or  the  basis  (<t>nk  :  0  <  k  <  2n)  of  Hm(Rp),  is 
represented  as  convolution  with  the  function  a(n„).  Therefore,  the  eigenfunctions  of  An  have  the  form 

(1.53)  Fw  =  X  exp(i{ok)<t>n<k 

k 

and  have  corresponding  eigenvalues  that  are  expressed  in  terms  of  the  Fourier  series  of  a(n„)  by 

(1.54)  An(F0))  =  a^n.OJjFd),  where  aA(n.o>)  =  X  a(n,k)exp(ico) 

k 

where  for  V  =  FFHR),  0  <  to  <  2n  and  k  is  summed  over  all  integers,  and  for  V  =  Hm(Rp).  to 
assumes  the  discrete  values  (2ttk/2n  :  0  <  k  <  2n)  and  k  is  summed  over  0  <  k  <  2n.  The  function 
aA(n,co)  will  be  called  the  spectrum  of  the  operator  An.  Clearly,  if  V  =  IF^R)  and  if  A  =  dr/dxr  Cr-th 
derivative  operator),  then  aA(n,tu)  =  2nraA(0,o)).  Figure  1.2  illustrates  the  spectrum  of  the  operator  A0 
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for  A  =  where  <t>  corresponds  to  N  =  3  and  V  =  H'  (R).  In  this  case  a(0,0)=-5.267y,  a(0,k)=0 

for  k  >  5,  and  a(0,.)  is  symmetric  (therefore  aA(0,w)  is  real).  The  spectrum  of  differential  operator.^ 
will  be  used  extensively  in  Section  5  and  in  Section  6.1. 

2  Classical  Results  Concerning  the  Approximation  of 
Linear  Variational  Problems 

2.1  Linear  Variational  Problems  in  Hilbert  Spaces. 

All  the  linear  elliptic  boundary  value  problems  to  be  discussed  in  this  paper  can  be  reduced  to  the 
following  (variational)  formulation 

Find  u  e  V  such  that 

(2.1) 

a(u,v)  =  L(v),  for  every  v  e  V. 

In  (2.1)  V,  a(.,.),  L  are  as  follows: 

(i)  V  is  a  Hilbert  Space  (real  for  simplicity)  with  scalar  product  (...)  and 
associated  norm  11.11. 

(ii)  a:  VxV  -->  R  is  a  bilinear  form  (possibly  non  symmetric),  continuous, 
and  V-elliptic  over  VxV;  the  last  property  means  that  there  exists 

a  >  0  such  that 

(2.2)  a(v,v)  >  otllvll2,  for  every  v  e  V. 

(iii)  L:  V  — >  R  is  linear  and  continuous. 

If  properties  (i)-(iii)  hold  it  follows  then  from  the  Lax-Milgram  Theorem  that  problem  (1.1)  has  a 
unique  solution. 

Remark  2.1  If  the  bilinear  form  a(„.)  is  symmetric  (i.e.  a(v,w)  =  a(w,v)  for  every  v,  w  in  V)  then 
problem  (1.1)  is  equivalent  to  the  following  minimization  problem: 


Find  u  e  V  such  that 

(2.3) 

J(u)  <,  J(v),  for  every  v  €  V, 
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where  J  is  defined  by 


(2.4)  J(v)  =  (l/2)a(v,v)  -  L(v). 

Indeed,  equation  (2.1)  can  be  seen  as  the  Euler- Lagrange  equation  associated  to  problem  (2.3). 

Remark  2.2  Let  V'  denote  the  dual  space  of  V  and  let  <.,>  :  V’xV  -->  R  denote  the  canonical  duality 
pairing.  Then  it  follows  from  the  Riesz  Representation  Theorem  that  there  exists  A  e  /som(V,V‘), 
uniquely  defined,  such  that 

(2.5)  a(v,w)  =  <Av,w>,  for  every  v,  w  e  V. 

Then,  for  notational  convenience,  introduce  1  6  V*  such  that 

(2.6)  L(v)  =  <l,v>,  for  every  v  6  V. 

Problem  (2.1)  is  therefore  equivalent  to  the  following  linear  equation' 

(2.7)  Au  =  1. 

Remark  2.3  It  follows  from  (2.5)  and  (2.6)  that 

(2.8)  la(v,w)l  <  IIAII  llvll  llwll,  for  every  v,  w  e  V, 

(2.9)  a(v,v)  >  IIA'1!!'1  llvll2,  for  every  v  e  V, 

where  IIAII  and  IIA  'll  are  the  standard  operator  norms.  Indeed,  the  largest  constant  a  in  (2.2)  is 
precisely  IIA‘1II'1. 

2.2  Examples 

We  illustrate  the  above  generalities  by  discussing  below  the  variational  formulation  of  some 
boundary  value  problems  for  second  order  differential  equations. 

Example  2,1  To  illustrate  the  above  generalities,  lets  consider  the  following  homogeneous  Dirichlet 
problem  on  the  interval  [0,1]  : 

(2.10)  -u"  +  ou  =  f  in  (0,1), 
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(2.1 1)  u(0)  =  u(l)  =  0, 


where  in  (2.10)  c  is  a  positive  constant  and  f  e  L2(0,1)  (actually  c  can  even  be  'slightly'  negative 
and  f  can  be  less  regular  than  an  L2  function).  It  can  be  shown  that  solving  problem  (2.10),(2.1 1)  in 
H0'(0,1)  =  {ue  H'(0,l):u(0)=u(l)=0)  is  equivalent  to  solving  a  linear  variational  problem  of  the  form 
(2.1)  with: 

1 

(i)  V  -  HoUO.l);  (v,w)  =  J  (v'w’  +  vw)dx,  llvll  =  (v,v)1/2. 

0 

1 

(ii)  a(v,w)  =  J  (v'w'  +  avw)dx. 

0 

1 

(iii)  L(v)  =  J  fvdx. 

0 

It  can  be  shown  that  the  hypothesis  of  the  Lax-Milgram  Theorem  are  satisfied  here  implying  that  the 
corresponding  problem  (2.1)  has  a  unique  solution  in  Ho^O.l)  which  is  also  the  unique  solution  of 
(2. 10), (2.1 1)  in  the  above  space.  In  this  case,  V*  can  be  identified  with  the  dual  space  H  ](0,1)  of 
H0'(0,1)  and  A  with  the  operator  A:  ^'(0,1)  -->  H''(0,1)  defined  by  A(v)  =  -v"  +  ov.  This 
concludes  (momentarily)  the  Dirichlet  problem. 

Example  2.2  Now  consider  the  following  inhomogeneous  Neuman  problem  on  the  interval  [0,1]  : 

(2.12)  -u"  +  au  =  f  in  (0,1), 

(2.13)  -u'(0)  =  c,  u'(l)  =  d, 

where  in  (2.12)  o  is  a  positive  constant  and  f  e  L2(0,1).  It  can  be  shown  that  solving  problem 
(2.12),(2.13)  in  HJ(0,1)  is  equivalent  to  solving  a  linear  variational  problem  of  the  form  (2.1)  with 
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V  =  H'(0,1),  a(„.)  the  same  as  in  example  2.1,  and  L  given  by: 

1 

(i)  L(v)  =  J  fvdx  +  cv(0)  +  dv(l). 

0 

As  in  example  2.1,  it  can  be  shown  that  the  hypothesis  of  the  Lax-Milgram  Theorem  are  satisfied  here 
implying  that  the  corresponding  problem  (2.1)  has  a  unique  solution  in  H’(0,1)  which  is  also  the 
unique  solution  of  (2.12),(2.13)  in  the  above  space.  In  this  case  it  is  very  convenient  to  identify  the 
space  (H  1(0,1))*  with  H' 1(0,1)  x  R^;  then  the  operator  A  is  defined  by 

(2.14)  Av  =  {-v"  +  sv,  {(v-v0)'(l),  -(v-v0)'(0))  ), 

where  vQ  is  the  unique  solution  in  Hq^O.I)  of  the  Dirichlet  problem 

(2.15)  -v0"  +  v0  =  -v"  +  cv, 

(2.16)  v0(0)  =  v0(l)  =  0. 

Equation  (2.15)  holds  in  H' 1(0,1)  and  it  is  quite  obvious  that  v0  depends  linearly  and  continuously  on 
v. 

2.3.  General  Approximation  Results. 

Concerning  now  the  approximation  of  problem  (2.1),(2.7)  we  consider  a  family  {Vn)n  of  closed 
subspaces  of  V;  the  Vn  are  finite  dimensional  in  practice.  On  Vn  it  is  then  quite  natural  to 
'approximate'  problem  (2.1)  by 

Find  un  e  Vn  such  that 

(2.17) 

a(un,v)  =  L(v)  for  every  v  e  Vn. 

Problem  (1.17)  obviously  has  a  unique  solution  from  the  Lax-Milgram  Theorem.  It  is  then  a  simple 
exercise  to  prove  the  following  approximation  property: 

(2.18)  llun  -  ull  II  All  IIA'!II  llv  -  u  1 1 ,  for  every-  v  e  Vn. 
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If  the  bilinear  form  a(.,.)  is  symmetric  then  the  above  inequality  can  be  refined  to  yield 

(2.19)  llun  -  ull  ^  (IIAII  IIA'1II)I/2  Hv  -  ult,  for  every  v  e  Vn. 

Remark  2.4  Relation  (2.18)  clearly  implies 

(2.20)  llun  -  ull  <  IIAII  IIA'Ml  Inf  (  llv  -  ull :  v  e  Vn); 
and  an  analogous  relation  follows  from  (2.19). 

From  the  results  above  we  have 

(2.21)  hmit  llun  -  ul>  =  0  if  limit  (Inf  {  llv-ull:ve  Vn!  )  =  0. 

n  ->+»o  n  -->+ 

In  the  case  where  V^c  v2  C—  Vn  —  ^en  (2-21)  is  automatically  satisfied  if  the  closure  of 

u  n  Vn  is  equal  to  V. 

In  the  following  sections  we  shall  use  the  properties  of  the  Daubechies  scaling  functions  and  wavelet 
functions  discussed  in  Section  1  to  construct  subspaces  Vn  of  H’(£2)  (where  ft  is  an  open  interval  of 
R)  satisfying  relation  (2.21). 

3  Wavelet  Solution  of  Linear  Elliptic  Problems 

In  this  section  we  shall  discuss  the  numerical  solution  of  linear  elliptic  problems  in  one  space 
dimension.  We  shall  consider  the  Neumann  problem  first  since  the  solution  methodology  for  the 
Dirichiet  problem  that  we  discuss  in  this  section  is  essentially  based  on  the  solution  of  a  very  small 
number  of  Neumann  problems  with  Lagrange  multipliers  on  the  right  hand  side.  Using  numerical 
experiments  we  shall  evaluate  the  quality  of  the  wavelet  solution:  indeed  the  results  to  be  described  in 
the  foliowing  sections  seem  to  indicate  that  wavelets  provide  accurate  approximate  solutions,  for  linear 
elliptic  boundary  value  problems  in  one  space  dimension. 

3.1  Solution  of  the  Neumann  Problem 

3.1.1  Formulation  of  the  Neumann  Problem 

The  Neumann  problem  to  be  discussed  in  this  section  can  be  formulated  as  follows: 
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(3.1)  -(au')'  +  (Su  +  yu’  =  f  in  (0,1), 


(3.2)  -(au')(0)  =  c.  (au')(l)  =  d. 

Using  the  approach  described  in  Section  2.2,  Example  2.2,  it  can  be  shown  that  problem  (3.1), (3. 2) 
has  the  following  variational  formulation 

ue  V, 

(3.3) 

a(u,v)  =  L(v),  for  every  v  e  V, 

with 

(3.4)  V  =  H1(0,1), 


(3.5)  a(v,w)  =  Ja(x)v'w'dx  +  JP(x)vwdx  +  J  y(x)v'wdx,  for  every  v,  w  e  V, 

0  0  0 

1 

(3.6)  L(v)  =  J  fvdx  +  dv(l)  +  cv(0),  for  every  v  e  V; 

0 

we  assume  here  that  f  e  Ll(0,l),  but  indeed  L(.)  can  be  any  linear  continuous  functional  over 
H^O.l).  Sufficient  conditions  to  apply  the  Lax-Milgram  Theorem  to  the  variational  problem  (3.3) 
consist  of: 

(3.7)  0  <  aQ  s  a(x)  <  aM  a.e.  on  (0,1), 

(3.8)  0  <  p0  <  P(x)  £  PM  a.e.  on  (0,1), 

(3.9)  ye  L2(0,1),  llyllL2(01)  <  min(a0,b0). 

3.1.2  Wavelet-Galerkin  Approximation  of  the  Neumann  Problem 

Let  N  =  3  and  let  <{>  be  the  corresponding  scaling  function  as  defined  in  Section  1 .  Let  n  be  any 
integer  and  let  Vn  be  defined  as  in  Section  1.  Define  Vn(0,l)  to  be  the  restriction  to  of  all  functions  in 
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Vn.  By  Theorem  1.1,  every  function  in  H’fO.l)  can  be  approximated  arbitrarily  closely  by  an  element 
in  Vn(0,l)  for  sufficiently  large  n,  hence 

(3.10)  Closure(  U  Vn(0,l))  =  H>(0,1). 

n 

Therefore,  the  family  of  subspaces  Vn(0, 1 )  of  H'(0,1)  is  well  suited  to  the  Galerkin  solution  of  the 
Neumann  problem  (3.1),(3.2),(3.3).  The  Galerkin  formulation  of  the  above  Neumann  problem  for 
every  integer  n  is  given  by 

Vn(0,l), 

(3.11) 

a(un,v)  =  L(v),  for  all  v  e  Vn(0,l). 

Since  (0,1)  is  bounded,  the  space  Vn(0,l)  is  finite  dimensional  and  is  a  closed  subspace  of  HHO.l). 
This  implies  problem  (3.11)  has  a  unique  solution.  It  follows  from  (2.21)  and  (3.10)  that 

(3.12)  limit  (un  -  u)  -->  0  in  H'(0,1). 
n  — > 

3.1.3  Solution  of  the  Approximate  Problem 

Fix  any  integer  n,  the  solution  un  of  the  approximate  problem  (3.11)  can  be  represented  as 
P 

(3.13)  un  =  ^  un,k^n,k-2N+t  ’  where  p  =  2"  +  2N  -  2  and  Uj . up  e  R, 

k=l 


where  the  functions  are  considered  to  be  restricted  to  (0,1).  This  yields  the  following  system  of  linear 
equations  in  p  unknowns 


(3.14) 


I 


k=l 


aW’n,k-2N+l’<t)n,j-2N+l)Un,k  ~ 


for  every  j  =  l,...,p. 


This  can  be  written  in  matrix  notation  as 
(3.15)  AU  =  F 
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with 


(3.16)  A  -  (ay),  ay  -  a(0nj.2N+1,<|»nii.2N+i) 
and 

(3.17)  F  =  (fj),  f,  =  L(4.ni.2N+1) 
and 

(3.18)  U  =  (Un<i). 

In  the  case  where  the  bilinear  form  a(.,.)  defined  by  (3.5)  is  L‘(0,1)  -  elliptic,  the  above  matrix  A  is 
positive  definite  implying  that  problem  (3.15)  has  a  unique  solution.  We  shall  also  observe  that  if  the 
function  g  vanishes  throughout  (0,1)  then  the  bilinear  form  a(.,.)  is  symmetric  implying  that  problem 
(3.15)  can  be  solved  by  standard  conjugate  gradient  algorithms  or  by  Cholesky  factorization. 

Remark  3.1  The  evaluation  of  the  and  f;  in  (3.16),(3.17)  required  to  compute  the  solution  of  the 
approximate  problem  (3.11)  can  be  performed  using  standard  numerical  quadrature  methods.  In  the 
special  case  that  the  a,  p,  and  y  are  piecewise  polynomial  then  a^  and  f  can  be  exactly  calculated  as 
rational  functions  of  the  scaling  parameters  h(k),  k  =  0,..„2N-1  using  the  techniques  described  in 
Section  1.3. 

Remark  3.2  Figure  3.1  illustrates  the  first  derivative  <}>'  of  0  corresponding  to  N  =  3  evaluated  at  321 
uniformly  distributed  points  over  the  interval  [0,5]  using  the  standard  first  difference  approximation 
<5'(x)  =  [0(x+h)  -  <t»(x)]/h.  Since  4>'  e  C087833  it  is  'barely'  continuous.  Therefore,  considerable  care 
must  be  taken  to  assure  accurate  evaluation  of  the  a-  and  f;. 

Remark  3.3  It  follows  from  the  Fredholm  Alternative  that  problem  (3. 1  ),(3.2)  is  well  posed  if  u  =  0  is 
the  only  solution  of  problem  (3.1), (3.2)  when  f  =  0  throughout  (0,1)  and  c  =  d  =  0. 
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3.1.4  Numerical  Experiments 


We  computed  wavelet  solutions  for  three  instances  of  problem  (3.1), (3.2): 

Test  Problem  3.1 
For  this  problem 

(3.19)  c  =  d  =  0, 


and  the  functions  a,  P,  and 

y  are  described  by 

a(x)  =  1  +  x 

for  0  <  x  <  .5, 

(3.20) 

a(x)  =  10  -  x 

for  .5  <  x  <  1, 

P(x)  =  20  -  lOx 

for  0  <  x  <  .5, 

(3.21) 

P(x)  =  1  +  X 

for  .5  <  x  <  1, 

(3.22) 

Y(x)  =  0 

forO  <  x  <  1. 

The  right  hand  side  f  was  chosen  such  that  the  solution  u  of  problem  (3.1),(3.2)  is 
(3.23)  u(x)  =  x3/3-x2/2+1; 

indeed  f  e  H  'fO.l)  and  over  each  interval  (0..5)  and  (.5,1)  it  can  be  represented  by  a  polynomial, 
however,  at  x  =  .5,  it  has  both  a  'jump'  discontinuity  and  a  Dirac  measure  component. 

To  solve  this  problem  we  chose  N  =  3  and  n  =  3  in  (3.13)  to  obtain  an  approximate  problem 
(3.1 1), (3. 15)  involving  p  =  12  unknowns.  This  linear  problem  was  solved  using  a  Cholesky 
factorization  technique  (permitted  here  since  y  =  0  over  (0,1)  implies  a(.,.)  is  symmetric).  Figures 
3.2(a)  and  3.2(b)  illustrate  the  coefficients  a(x)  and  |3(x)  over  (0,1);  Figure  3.3(a)  illustrates  the 
comparison  between  the  exact  solution  (dotted  graph)  and  the  computed  solution  (solid  graph).  Figure 
3.3(b)  illustrates  the  variation  of  the  error  en  =  u-un  over  (0,1).  We  observe  from  Figures  3.3(a)  and 
(b)  that  enis  small  over  (0,1)  and  does  not  exhibit  any  special  behavior  at  x  =  .5  (where  a,  p  and  f 
are  discontinuous  and/or  singular). 
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Test  Problem  3.2 
For  this  problem 

(3.24)  c  =  1  and  d  =  0, 


and  the  functions  a,  P,  and  y  are  described  by 
a(x)  =  1  +  x  for  0  <  x  <  .5, 

(3.25) 

a(x)  =  2  +  x  for.5<x<l, 
p(x)  =  1  +  x2  for  0  <  x  <  .5, 

(3.26) 


p(x)  =  2  +  x  for  .5  <  x  <  1, 
y(x)  =  1  +  x  for  0  <  x  <  .5 

(3.27) 


y(x)  =  2-x  for  0  <  x  <  1. 

The  right  hand  side  f  was  chosen  such  that  the  solution  u  of  problem  (3.1), (3.2)  is 


(3.28)  u(x)  =  x  -x2/2; 


again  f  6  H_1(0,1)  and  exhibits  the  same  qualitative  behavior  at  x  =  .5  as  in  the  previous  test  problem. 

To  solve  this  problem  we  chose  N  =  3  and  n  =  4  in  (3.13)  to  obtain  an  approximate  problem 
(3. 11), (3. 15)  involving  p  =  20  unknowns.  Since  y*  0  over  (0,1)  the  matrix  A  in  (3.15)  is  nor. 
symmetric.  Therefore,  this  linear  problem  was  solved  using  an  LU  factorization  technique.  Figures 
3.4(a),(b),(c)  illustrate  the  coefficients  a(x),  P(x),  and  y(x)  over  (0,1);  Figure  3.5(a)  illustrates  the 
comparison  between  the  exact  solution  (dotted  graph)  and  the  computed  solution  (solid  graph).  Figure 
3.5(b)  illustrates  the  variation  of  the  error  en  =  u-un  over  (0,1).  We  observe  from  Figures  3.5(a)  and 
(b)  that,  as  in  the  previous  example,  en  is  small  over  (0,1)  and  does  not  exhibit  any  special  behavior  at 
x  =  .5. 


Test  Problem-3, 3 

This  test  problem  concerns  the  solution  of  the  following  one  dimensional  Neumann  problem 

(3.29)  -u"  +  u  =  (1  +  rc2)sinjtx  +  1  on  (0,1), 
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(3.30)  -u'(0)  =  u'(l)  =  -n. 


whose  exact  solution  is  given  by 

(3.31)  u(x)  =  sinttx  +  1. 

We  have  solved  problem  (3.29),(3.30)  using  the  wavelet  based  method  described  in  Sections  3. 1,3. 2. 
taking  N  =  3  and  m  =  3,4,5,6,7,  the  corresponding  values  of  p  being  then  12,20,36,68,132.  We  took 
advantage  of  these  various  calculations  to  study  the  influence  of  m  (and  p)  on  the  approximation  error; 
therefore  in  Figures  3.6(a)  to  3.6(e)  we  have  plotted  the  variation  of  en  =  u  -  un  on  (0,1).  In  Figure  3.7 
we  have  shown  on  a  log-log  scale  the  variation  of  l!enll  as  a  function  of  p  (=p(n))  showing  that  the 
above  error  varies  approximately  like  p'2-5.  This  behaivor  suggest  that  the  above  appproximation  is 
between  piecewise  linear  and  piecewise  quadratic  approximation. 

3.2  Solution  of  the  Dirichlet  Problem 

3.2.1  Formulation  of  the  Dirichlet  Problem 

The  Dirichlet  problem  to  be  discussed  in  this  section  can  be  formulated  as  follows: 

(3.32)  -(txu')'  +  Pu  +  yu'  =  f, 

(3.33)  u(0)  =  c,  u(l)  =  d; 

we  suppose  here  that  f  6  H'*(0,1). 

Consider  v  e  Ho*(0,l)  (  =  {v  :  v  e  H'(0,1),  v(0)  =  v(l)  =  0));  then  multiplying  (in  the  sense  of 
the  duality  pairing)  both  sides  of  equation  (3.32)  by  v  we  obtain 

1 

(3.34)  J  (auV  +  puv  +  yu'v)dx  =  <f,v>,  for  every  veH0'(0,l). 

0 

Suppose  that  problem  (3.32), (3. 33)  has  a  solution  u  in  H'(0,1);  then  u  necessarily  satisfies  the 
following  variational  condition 

ue  H!(0,1),  u(0)  =  c,  u(l)  =  d, 

(3.35) 

a(u,v)  =  <f,v>  for  every  v  e  Ho’(0,l). 
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In  (3.35)  the  bilinear  form  a(.,.)  is  defined  by  (3.5).  Actually,  (3.35)  implies  (3.32),(3.33).  Suppose 
that  conditions  (3.7)-(3.9)  on  the  functions  a,  b,  g  are  satisfied,  then  we  can  easily  prove  that  the 
bilinear  form  a(.,.)  is  continuous  and  H0'(0,1)  -  elliptic  over  Ho’(0,1)xH0H0,1).  This  implies  that 
any  solution  of  (3.32),(3.33),(3.35)  is  necessarily  unique.  To  prove  the  existence  of  a  solution  we 
shall  use  again  the  Lax-Milgram  Theorem.  To  overcome  the  (small)  difficulty  associated  to  the  fact  that 
unless  c  =  d  =  0,  u  g  Hq'IO.I),  we  introduce  the  functions  u0  and  w  defined  by 

(3.36)  u0(x)  =  c  +  (d-c)x, 
and 

(3.37)  w  =  u  -  u0. 

The  function  w  is  clearly  a  solution  of  the  following  variational  problem 
w  6  HoHO.l), 

(3.38) 

a(w,v)  =  <f,v>  -  a(u0,v)  for  every  v  e  H0'(0,1). 

From  the  properties  of  a(.,.)  the  right  hand  side  of  equation  (3.38)  depends  linearly  and  continuously 
on  v  e  Hg'lO.l).  We  can  therefore  apply  the  Lax-Milgram  Theorem  to  problem  (3.38)  to  obtain  a 
unique  solution  w.  Combining  (3.37),(3.38)  we  have  thus  proved  the  existence  of  a  solution  u  to 
problem  (3. 32), (3. 33), (3. 35)  (uniqueness  was  already  established). 

Remark  3.4  Using  the  fact  that  the  seminorm  defined  by  Nv’Nl2(0,1)  *s  a  norrn 
over  Ho'fO.l),  equivalent  to  the  standard  H!(0,1)  -  norm,  problem  (3. 32), (3. 33), (3. 35)  is  still  well 
posed  if  we  assume  for  example  that  P  =  0  over  (0,1)  and  y  =  constant  over  (0,1),  relation  (3.7)  being 
still  satisfied. 

3.2.2  Reduction  of  the  Dirichlet  Problem  to  the  Neumann  Problem 

There  are  various  ways  of  solving  Dirichlet  problems  using  variational  methods.  Two  fairly 
classical  methods  are  the  following: 

(i)  Approximate  H'(£2)  by  a  finite  dimensional  subspace  Vh,  then  decompose  Vh  as  follows: 
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(3.39) 


Vh=V0h©Mh 

where  VQh  approximates  H^fft),  then  construct  a  function  u^  e  Mh  that  satisfies  (approximately)  the 
Dirichlet  boundary  condition.  Finally,  introducing  wh  =  uh  -  u^  reduce  the  approximate  Dirichlet 
problem  to  a  variational  problem  in  V0h.  This  follows  the  approach  discussed  in  Section  3.2.1  to  solve 
via  (3.36)-(3.38)  the  Dirichlet  problem  (3.32),(3.33),(3.35).  The  construction  of  in  dimension  >  2 
may  be  a  complicated  problem  in  itself  for  some  types  of  approximations. 

(ii)  Reduce  the  solution  of  the  Dirichlet  problem  to  the  solution  of  Neumann  problems  using  boundary 
multipliers  and/or  penalization  of  the  Dirichlet  condition.  This  approach  is  used  in  some  sense  in 
production  finite  element  codes  for  solving  elliptic  problems. 

In  this  paper  we  will  focus  on  the  second  approach  since  it  provides  a  possible  methodology  for 
solving  multidimensional  elliptic  problems.  However  we  will  also  briefly  comment  on  the  first 
approach  which  seems  to  be  more  technically  involved  and  will  be  discussed  in  more  detail  in  a 
forthcoming  paper. 

In  order  to  describe  approach  (ii)  we  consider  the  following  abstract  Dirichlet  problem  in  H’(0,1): 
ue  H1(0,l),u(O)  =  c,u(l)  =  d, 

(3.40) 

a(u,v)  =  <f,v>  for  every  v  e  H0’(0,1), 

where  f  e  H-1(0,1)  and  where  the  bilinear  form  a(.,.)  is  continuous  and  elliptic  over  H’(0,1)  (the  case 
where  a(.„)  is  elliptic  over  HqHO.I)  but  not  over  H*(0,1)  will  be  addressed  later),  then  by  the  Lax- 
Milgram  Theorem,  problem  (3.40)  is  well  posed.  Denote  by  L:  HHO.l)  — >  R  a  linear  continuous 
functional  satisfying 

(3.41)  L(v)  =  <f,v>  for  every  v  e  H0’(0,1); 

such  a  functional  always  exists  by  the  Riesz  Representation  Theorem.  To  the  Dirichlet  problem  (3.40) 
we  associate  the  following  problem: 

(3.42)  (u,  X)  e  H’(0,l)xR2;  X  =  (Xj.Xj), 

(3.43)  a(u,v)  =  L(v)  +  X,v(0)  +  XjvO),  for  every  v  e  H](0,1), 

(3.44)  u(0)  =  c,  u(l)  =  d. 

Suppose  that  problem  (3.42)-(3.44)  has  a  solution  (u,X);  taking  v  e  Ho'(0,l)  in  (j.43)  it  follows 
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from  (3.41)  that  u  is  also  the  solution  of  (3.40).  Wc  shall  now  show  problem  (3.42),(3.44)  can  be 
reduced  to  the  solution  of  3  Neumann  problems  and  to  that  of  a  2x2  well  posed  linear  system.  Define 
uo*ui'u2  as  ***e  solutions  of  the  following  variational  problems  in  H 1  (0, 1  ); 

u0  e  HHO.l), 

(3.45) 

a(u0,v)  =  L(v),  for  all  v  e  H'fO.l), 
u,  e  H'(O.l), 

(3.46) 

a(U],v)  =  v(0),  for  all  v  e  H'(0,1), 

Ujfi  H'(0,1) 

(3.47) 

a(u2,v)  =  v(l),  for  all  v  €  H'(O.l). 

The  function  u  in  (3.42)-(3.44)  necessarily  satisfies 

(3.48)  u  =  u0  +  XjUj  +  Xju2 
implying  that  X  satisfies 

UjCO)^  +  UjfOlXj  =  c  -  u0(0), 

(3.49) 

u,(l)X.,  +  4(1)^  =  d  -  u0(l). 

If  (3.49)  has  a  solution  (X1,X2)  =  X  then  the  pair  {u0  +  XjU,  +  X.2u2,  X.)  is  a  solution  of  (3.41)- 
(3.44).  To  show  that  (3.49)  has  a  solution  we  shall  verify  that  the  matrix 

I  Uj(0)  u2(0)  I 

(3.50)  A  =  I  1 

lu,(l)  4(1)1 

is  positive  definite;  to  show  this  property  take  p  =  {444)  e  R2  and  associate  to  p  the  function 
defined  by 
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(3.51)  =HiU,  +^2- 

The  function  is  clearly  the  unique  solution  of 
u^  e  H'(0,1), 

(3.52) 

a(U|I,v)  =  (ijv(O)  +  lijvO),  for  all  v  e  H'(0,1). 

We  also  have,  from  (3.50M3.52)  and  from  the  ellipticity  of  a(.,.) 

(3.50)  Ap..p.  =  HjU^O)  +  (ijU^l)  =  a(U|i,U|i)  £  0  for  all  p  e  R2. 

Suppose  now  that  Ap.p.  =  0,  then  (3.50)  and  the  H1  -  ellipticity  of  a(.„)  implies  that  u^  =  0  over 
(0,1),  this  fact  and  (3.52)  implies  =  |J.2  =  0.  Matrix  A  being  positive  definite  is  regular  and 
therefore  X  =  (X,^)  is  uniquely  determined  from  (3.49).  In  fact  we  have  proved  that  problem  (3.40) 
and  (3.41)-(3.44)  are  equivalent 

Remark  3.5  If  the  bilinear  form  a(.,.)  is  symmetric  then  problem  (3.40)  is  equivalent  to  the 
minimization  problem 

Vc,d. 

(3.54) 

J(u)  £  J(v),  for  every  v  e  Vcd . 

where  Vc  d  =  (v:  v  e  HHO.l),  v(0)  =  c,  v(l)  =  d)  and  J(v)  =  (t/2)a(v,v)  -  L(v). 

Furthermore,  the  pair  (u,X),  which  is  a  solution  of  (3.41)-(3.44),  is  a  stationary  point  over 
H'(0,l)xR2  of  the  following  Lagrangian  functional 

L(v,p)  =  J(v)  +  n,(v(0)-c)  +  Hj(v(l)-d). 

The  vector  X  is  therefore  the  Lagrange  multiplier  associated  to  the  two  linear  constraints  v(0)  -  c  =  0, 
v(l)  -  d  =  0.  In  the  nonsymmetric  case  we  shall  still  call  X  a  multiplier.  The  (important)  case  where  the 
bilinear  form  a(.,.)  is  HHO.l)  -  elliptic  can  be  treated  by  a  similar  approach.  We  shall  suppose  for 
simplicity  that 

(3.55)  a(v,v)  2  y  llv’llL2(0  for  all  v  €  H*(0,1)  with  y  >  0. 
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In  that  case,  instead  of  (3.41)-(3.44),  we  associate  to  (3.40)  the  following  problem  (with  r  >  0): 

(3.56)  (u.X)  £  H1(0,l)xR2;  X  =  [X^], 

(3.57)  a(u,v)  +  r(u(0)v(0)  +  u(l)v(l)]  =  L(v)  +  XjV(O)  +  ^(D.for  aU  ve  H'(0,1), 

(3.58)  u(0)  =  c,  u(l)  =  d. 

Define  a/.,.)  by 

(3.59)  ar(v,w)  =  a(v,w)  +  r[v(0)w(0)  +  v(l)w(l)]; 
it  follows  then  by  (3.55)  that 

(3.60)  aT(v,v)  2  WIVHlW*  +  r(lv(0)l2  +  lv(1)|2): 

since  it  can  be  (easily)  shown  that  v  — >  [(Hv'llL2,0  I))2  +  lv(0)l2  +  lv(l)l2)1/2  defines  over  H'(0,1)  a 
norm  equivalent  to  the  usual  HUO.l)  -  norm.  It  follows  from  (3.60)  that  if  r  >  0  the  bilinear  form 
ar(.,.)  is  HHO.l)  -  elliptic.  From  this  property  we  can  easily  prove  that  problems  (3.40)  and  (3.56)- 
(3.58)  are  equivalent;  also,  the  pair  {u,Xj  can  be  obtained  through  the  solution  of  3  ’Neumann' 
problems  associated  to  the  bilinear  form  a/.,.)  followed  by  the  solution  of  a  2x2  linear  system 
associated  to  a  positive  definite  matrix.  The  wavelet  implementation  of  this  technique  is  discussed  in 
Section  3.2.3. 

3.2.3  Numerical  Experiments 

We  computed  wavelet  solutions  of  three  instances  of  problem  (3.32), (3. 33). 

Test  Problem  3.4  For  this  problem  u(0)  =  0  and  u(l)  =  5/6;  on  the  other  hand  a  and  P  are  given  by 
(3.20)  and  (3.21),  respectively,  and  y  =  0.  The  right  hand  side  f  has  been  chosen  such  that  the  solution 
u  of  problem  (3.32),(3.33)  is 

(3.61)  u(x)  =  x3/3  -  x2/2  +  1. 

To  solve  this  problem  we  chose  N  =  3  and  n  =  3  in  (3.13)  to  obtain  an  approximate  problem  involving 
12  unknowns.  The  maximum  norm  error  between  the  exact  and  computed  solutions  is  1.2  x  10  3, 
while  the  boundary  conditions  are  exactly  satisfied  as  we  can  ree  in  Figure  3.8  where  both  exact  and 
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computed  solutions  are  represented.  As  one  can  see  our  procedure  for  treating  the  Dirichlet  condition 
via  the  solution  of  Neumann  problems  (see  Section  3.2.2)  is  fairly  accurate,  particularly  in  the 
neighborhood  of  the  boundary  points  0  and  1. 

Test  Problem  3.5  For  this  problem  we  have  u(0)  =  0,  u(l)  =  .5  and  a,  b,  and  g  defined  by 
(3.25), (3.26)  and  (3.27),  respectively.  The  right  hand  side  f  has  been  chosen  such  that  the  solution  u 
of  problem  (3.32),(3.33)  is 

(3.62)  u(x)  =  x  -  x2/2. 

To  solve  this  problem  we  chose  N  =  3  and  n  =  4  to  obtain  an  approximate  problem  involving  20 
unknowns.  The  maximum  norm  error  between  the  exact  and  computed  solutions  is  6.0  x  10-4,  while 
the  boundary  conditions  are  exactly  satisfied  as  we  can  see  in  Figure  3.9,  where  we  compare  the  exact 
and  computed  solutions,  and  in  Figure  3.10,  where  we  have  shown  the  variatiojn  of  the  error  on  [0,1], 

Test  Problem  3.6  This  test  problem  concerns  the  solution  of  the  following  one  dimensional  Dirichlet 
problem 

(3.63)  -u"  +  u  =  (1  +  7t2)sinjtx  +  1  on  (0,1), 

(3.64)  u(0)  =  u(l)  =  1, 

whose  exact  solution  is  given  by 

(3.65)  u(x)  =  simtx  +  1. 

We  have  solved  problem  (3.63),(3.64)  taking  N  =  3  and  m  =  3,5,6,  the  corresponding  values  of  p 
being  then  12,36,68.  We  took  advantage  of  these  various  calculations  to  study  the  influence  of  m  (and 
p)  on  the  approximation  error.  Figures  3.11(a)  to  3.11(c)  shows  the  variation  of  en  =  u  -  un  on  (0,1) 
for  m  =  3,5,6  respectively. 

4  Solution  of  Singularly  Perturbed  Linear  Elliptic 
Problems 

We  consider  in  this  section  the  wavelet  solution  of  a  particular  one  dimensional  Dirichlet  problem, 
namely 


(4.1)  -eu"  -  u'  =  1  on  (0,1) 

(4.2)  u(0)  =  u(l)  =  0, 
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with  e  >  0.  We  shall  focus  our  attention  to  the  cases  where  e  is  'small'.  The  exact  solution  of  problem 
(4.1),(4.2)  is  given  by 

(4.3)  u£(x)  =  (1-x)  -  [exp((l-x)/e)-l]  /  [exp(l/e)-l]; 

for  small  values  of  e  it  exhibits  a  boundary  layer  of  thickness  of  order  e  in  the  neighborhood  of  x  =  0. 
The  solution  uE(x)  has  been  shown  in  Figure  4. 1  for  e  =  1/710. 

Our  motivation  with  this  example  is  to  study  the  ability  of  wavelet  approximations  to  represent  stiff 
gradient  phonomena  such  as  those  occuring  in  fluid  mechanics  and  semiconductors  for  example. 

In  order  to  solve  problem  (4.1),(4.2)  using  the  wavelet  techniques  discussed  in  Section  3  we  split 
the  interval  (0,1)  into  (0,xc)  and  (xc,l)  with  0  <  xc<  1.  The  idea  here  is  to  take  xc  small  and  to  solve 
problem  (4.1), (4.2)  using  a  domain  decomposition  method;  the  subproblem  associated  to  (0,xc)  will 
treat  the  boundary  layer  behavior  of  the  solution  near  x  =  0  while  the  subproblem  associated  to  (xc,l) 
will  describe  the  smoother  component  of  u£(x).  The  local  solutions  are  matched  at  xc  using  a  multiplier 
method  which  will  force  the  continuity  of  local  solutions  and  of  their  first  order  derivatives  at  x  =  xc 
(see  [Bourgat,  Glowinski,  LeTallec,  Vidrascu,  1989]  for  further  details  concerning  domain 
decomposition  techniques  in  multidimensions  containing  the  one  used  here  as  a  special  case). 

Problem  (4.1),(4.2)  has  been  solved  for  e  =  1/710  with  xc  =  1/32  using  first  a  combination  of  20 
basis  functions  in  (0,xe)  and  35  in  (xc,l)  and  then  using  a  combination  of  36  and  35  basis  functions  in 
(0,xc)  and  (xc,l)  respectively.  The  corresponding  results  are  shown  in  Figures  (4.2)-(4.4).  Observe 
that  the  wavelet  solutions  accurately  approximate  the  behavior  in  the  boundary  layer  despite  the  fact 
that  in  our  calculation  xc  was  indeed  of  the  order  of  Ve  instead  of  e,  reflecting  the  super-linear 
approximation  properties  of  the  Daubechies  wavelets. 

5  Multigrid  Solutions  of  Linear  Elliptic  Problems  with 
Periodic  Boundary  Conditions 

In  previous  sections  we  discussed  wavelet  solutions  of  problems  obtained  by  directly  solving  the 
system  of  algebraic  equations  resulting  from  discretization.  Frequently  in  practice,  these  algebraic 
equations  are  so  large  as  to  require  the  use  of  iterative  methods  that  apply  a  sequence  of  relaxa’  .>n 
operations  to  improve  the  current  estimate  v.  Typical  relaxation  operations,  including  Gauss  Seidel, 
Jacobi,  Successive  Over  Relaxation,  and  Preconditioned  Conjugate  Gradients,  have  a  tend-ncy  to 
dampen  high  frequency  components  of  the  error  at  a  faster  rate  than  the  low  frequency  components. 
This  phenomena  may  result  in  slow  convergence.  Multigrid  methods,  described  in  [Brandt,  1977], 
(Briggs,  1987],  [Fedorenko, 1961, 1987],  and  [McCormick,  1987],  address  this  problem  by  utilizing 
relaxation  methods  at  multiple  scales  of  resolution  to  balance  the  dampening  across  all  frequency 
components  of  the  error.  Multigrid  methods  can  utilize  diverse  relaxation  techniques  and  various 
spatial  discretization  methods,  including  finite  differences  and  finite  elements. 


The  multiscale  properties  of  general  wavelets,  as  described  in  [Mallat.1987]  and 
[Meyer, 1985, 1986, 1988],  together  with  the  superior  approximation  and  computational  properties  of 
the  specific  class  of  Daubechies  wavelet  bases  described  in  Section  1,  suggest  that  the  Daubechies 
wavelet  bases  may  provide  an  effective  tool  for  developing  improved  multilevel  methods.  This  section 
describes  preliminary  theoretical  and  numerical  results  for  wavelet  based  multilevel  methods  applied  to 
a  simple  class  of  mode!  problems. 

5.1  Multilevel  Solutions 

The  linear  elliptic  equation  to  be  discussed  is  given  by 

(5.1)  A  u  =  f, 

where  A  is  a  linear  strongly  elliptic  differential  operator  of  order  2  and  where  A,  u  and  f  are  periodic 
with  period  1.  Let  N  £  3,  therefore  <J>  e  H^(R)  and  for  all  n  2:  0  let  V  =  H^(Rp)  and  V  3  Vn  =  Vn(Rp) 
be  defined  as  in  Section  1.3.3.  Since  the  operator  A  is  strongly  elliptic,  it  is  an  isomorphism  from  V 
onto  V*  =  H~l(Rp)  and  therefore  there  exists  a  unique  solution  to  problem  (5.1)  if  f  e  V*. 

Let  Pn  denote  the  projection  of  V  onto  Vn.  Since  V*  3  V  3  Vn,  Pn*  projects  V*  into  Vn*.  For 
every  n  2  0  the  approximation  of  problem  (5.1)  corresponding  to  the  subspace  Vn  is 

(5.2)  A11un  =  Pn*(f). 

where  An  =  Pn*A  Pn  and  u„  =  Pn(u).  Qearly,  by  the  Lax -Mil gram  theorem  problem  (5.2)  has  a  unique 
solution  u„  which,  by  property  (2.1 8),  satisfies  II  u„  -  u  II  £  UAH  IIA'1  II II  v  -  u  II  for  every  v  e  Vn,  where 
all  the  norms  are  with  respect  to  V  =  H^Rp).  By  inequality  (1.31)  it  follows  that  II  un  -  u  II  is  on  the 
order  of  hN"’  where  h  =  2  ”  corresponds  to  the  "step  size"  of  the  approximation.  Thus  the  H1  norm  of 
the  error  has  order  N-l.  If  the  coefficients  of  the  operator  A  are  "smooth"  and  if  f  e  L2  then  it  may  be 
shown  that  u  €  and  furthermore  that  the  L2  norm  of  the  error  u„  -  u  may  be  of  order  as  high  as  N. 
This  conclusion  does  not  hold  in  general. 

Throughout  the  remainder  of  this  section  we  assume  that  A  has  constant  coefficients  and  that 
f  €  L2,  therefore  by  the  preceding  remarks  the  L2  norm  of  the  truncation  error  has  order  N.  We  will 
proceed  to  describe  the  full  multigrid  V-cycle  method  (here  we  use  the  terminology  in  [Briggs.1987]). 
First,  we  choose  integers  nf  >  n0  2  1  which  correspond  to  the  finest  and  the  coarsest  levels  of 
approximations  to  problem  (5.1).  The  solution  of  the  approximate  problem  (5.2)  for  n  =  no  will  always 
be  solved  using  a  direct  method.  The  first  step  of  the  algorithm  consist  of  computing  the  solution  v0  of 
problem  (5.2)  for  n  =  Hq.  Next,  we  increment  n  <~  n+1  and  choose  vn_j  (e  V„  )  as  the  initial  guess 
for  the  solution  of  problem  (5.2)  at  level  n  (this  requires  calculating  the  expansion  of  vn  in  a  new  basis 
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of  Vn,  using  the  Mallat  transform  T  defined  by  equation  (1.34)).  Next,  apply  a  V-cycle  relaxation 
operation  to  improve  the  estimated  solution  vn  of  problem  (5.2).  In  this  paper  we  discuss  the  V-cycle 
procedures  based  on  Jacobi  relaxation  techniques.  The  Jacobi  relaxation  operators  are  defined  by  the 
following  affine  mappings  J;t  V„  -->  Vn,  for  i  =  1, 2 

(5.3)  Linear  Jacobi  Relaxation  vn  <~  Jj(vn)  =  vn  +  crn< 

(5.4)  Quadratic  Jacobi  Relaxation  vn  <“  ■^vi')  =  vn  +  c  rn  +  dAnrn- 
where  in  (5. 3), (5.4)  the  residual  rn  is  defined  by 

(5.5)  rn  =  Pn‘(f)-Anvn; 

here  Vn*  is  identified  with  Vn  so  that  rn  e  Vn.  In  (5.3), (5. 4)  the  constants  c  and  d  are  optimally 
chosen  as  follows: 

First  define  (in  the  sense  of  [Briggs,  1987])  the  set  of  the  oscillatory  eigenvalues  of  the  operator  An 
(where  as  above  Vn*  is  identified  with  V„  so  that  maps  Vn  to  itself)  by 

(5.6)  A  =  {aA(n,co) :  7t/2  SoiS  3jc/2)  ; 

as  in  equation  (1.54),  and  then  define  Xj  and  Xj  by 

(5.7)  Xj  =  min  (X  :  X  e  A},  and  Xj  =  max  (X  :  X  e  A). 

The  constants  c  and  d  and  their  corresponding  dampening  ratios  D  (over  the  set  A)  for  the  above 
Jacobi  relaxation  procedures  are  given  by 

(5.8)  Linear  Jacobi :  c  =  TAXj+Xj),  D  =  (Xj-Xj)  /  (Xj+X*), 

(5.9)  Quadratic  Jacobi :  d  =  -2  /  [X^X^+fX^  +X3)  /4j,  c  =  ~d(X  j +X2 ),  D  =  (Xi -X| )  /[(Xj+X 2)  +4XjX], 

If  A  =  -d2/dx2  then  D  =  1/3  for  the  linear  Jacobi  relaxation  procedure  (5.3)  applied  to  the  standard 
finite  difference  approximation  of  problem  (5.1);  on  the  other  hand  we  have  D  =  2/3  if  (5.3)  is  applied 
to  the  (N=3)  wavelet  approximation  of  problem  (5.1).  Concerning  the  quadratic  Jacobi  relaxation 
procedure  (5.4),  it  yields  D  =.315  for  the  (N=3)  wavelet  approximation  of  problem  (5.1);  indeed  we 
obtain  an  even  smaller  dampening  ratio  D  =  1/17  for  the  finite  difference  approximation  of  problem 
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(5.1),  but  the  increased  complexity  of  the  corresponding  algorithm  is  such  that  only  the  linear 
procedure  (5.3)  is  used  in  practice  for  finite  difference  approximations  (on  the  other  hand  procedure 
(5.4)  is  computationally  advantages  for  (N=3)  wavelet  approximations  of  (5.1)). 

The  first  step  of  the  V-cycle  Jacobi  relaxation  consists  of  applying  either  (5.3)  or  (5.4)  to  the  initial 
guess  vn  so  as  to  dampen  the  oscillatory  components  of  the  error  e„  =  u„  -  vn  to  within  truncation 

error.  Under  the  assumptions  on  A  and  f  above,  the  truncation  error  for  wavelet  approximations  of 

-N  -2 

problem  (5.1)  satisfies  II  II  =  2  II  e,,  II  (to  be  compared  to  II  e„+1  II  =  2  II  e,,  II  for  finite  difference 
approximations),  therefore  it  is  necessary  to  apply  approximately  -N  /  log2(D)  Jacobi  iterations  to 
dampen  all  of  the  oscillatory  eigenfunction  components  of  the  error  by  a  factor  of  2"N .  The  second  step 
of  the  V-cycle  Jacobi  relaxation  procedure  consists  dampening  the  non-oscillatory  components  of  the 
error  as  follows: 

(i)  Calculate  the  projection  Pn.](rn)  of  the  current  residual  onto  Vn.j  (this  requires  calculating  the 
inverse  Mallat  transform  T’1  defined  by  equation  (1.35)). 

(ii)  Solve  the  residual  equation 

(5.10)  A^-P^r,). 

(iii)  Update  the  estimated  solution  vnby 

(5.11)  vn  <~  vn  +  en.! : 

observe  that  this  requires  calculating  the  expansion  of  en.j  in  the  basis  for  Vn  using  the  Mallat 
transform  T  defined  by  equation  (1.34).  In  practice  step  (ii)  will  consist  of  v„  iterations  of  the  linear  or 
quadratic  Jacobi  relaxation  procedure;  also  in  step  (iii)  the  update  (5.11)  will  be  followed  by 
iterations  of  our  chosen  relaxation  procedure.  Furthermore,  in  step  (ii)  the  procedure  described  by  step 
(i)  is  applied  recursively  to  the  estimate  obtained  after  Jacobi  relaxation  in  order  to  dampen  the 
remaining  non-oscilatory  error  components  .This  results  in  a  complete  V-cycle  in  the  traditional  sense 
(cf.,  e.g.,  [Briggs,1987])  that  starts  at  level  n  >  n0,  then  descends  to  level  n0,  and  finally  proceeds 
back  to  level  n+1.  This  V-cycle  is  then  repeated  until  the  finest  level  n{  is  reached. 

2 

Comparing  'he  total  number  of  arithmetic  operations  required  to  achieve  an  L  error  equal  to  e 
using  various  full  multigrid  V-cycles  for  solving  problem 

(5.12)  -u"  +  u  =  f  on  (0,1), 

(5.13)  u(0)  =  u(l),  u’(0)  =  u’(l). 
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we  obtain  (with  P!  =  e  1/2  and  P2  ~  e’1/N) 

Finite  Difference  Linear  Jacobi  #operations  =  45  P[ 

Wavelet  Linear  Jacobi  Operations  =  12N(4N-2)  P2 

Wavelet  Quadratic  Jacobi  Operations  =  ,7xl2N(4N-2)  P2 

which  shows  that  for  wavelet  approximations  the  quadratic  Jacobi  relaxation  procedure  is  more 
advantageous  than  the  linear  one,  both  being  superior  -  for  "small"  e  -  to  the  linear  Jacobi  relaxation 
procedure  applied  to  the  finite  difference  approximation  of  problem  (5.1)  (  at  least  if  N  >  3  ,  which  is 
always  the  case  in  practice). 

5.2  Numerical  Experiments 

In  this  section  the  multilevel  methods  discussed  in  Section  5.1  have  been  applied  to  the  practical 
solution  of  problem  (5.12),  (5.13)  for  f  given  by 

(5. 14)  f(x)  =  sin27tx  +  sinlOnx. 

For  the  right  hand  side  (5.14)  the  exact  solution  of  problem  (5.12),  (5.13)  is  given  by 

(5.15)  u(x)  =  sin2jtx/(l  +  4ji2)  +  sinlOrcx  /  (1  +  lOOn2). 

The  variations  of  f  and  u  are  shown  in  Figure  5.1. 

First,  with  h  =  1/1  we  have  used  the  following  finite  difference  scheme  to  approximate  problem 
(5.12M5.14)  by 

(5.16)  -(ui+l  +  Uj_i  -  2uj)  /  h2  +  u;  =  f(Xj),  1  £  i  £  I; 

we  force  the  periodicity  of  the  solution  by  requiring  in  (5.16) 

(5.17)  tig  =  u;  ifi=l,and  uj+1  =  Uj  if  i  =  I. 

The  approximate  problem  (5. 16), (5. 17)  has  been  solved  using  for  h  =  1/256  using  a  six  level 
realization  of  the  general  multilevel  method  described  above.  For  this  test  problem  we  employed  the 
linear  Jacobi  relaxation  procedure  (5.3)  with  vn  =  =  2.  Figure  5.2  shows  the  computed  solution  uh 
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and  the  variation  of  the  error  u  -  uh  over  [0,1];  the  max  norm  of  the  error  is  3.3x10-4. 

Next  we  have  considered  the  numerical  solution  of  problem  (5.12)-(5.14)  by  a  combination  of 
(N=3)  wavelet  approximation  and  the  multilevel  methods  of  Section  5.1.  We  have  compared  the 
performance  of  the  linear  and  quadratic  Jacobi  relaxation  procedures  using  in  both  cases  nf  =  6  and 
nQ  =  3  resulting  in  4  levels;  this  corresponds  to  dim  Vn  =  2n  ,  3  S  n  S  6.  In  both  cases  we  have  used 
vn  =  =  4.  Figures  5.3  (linear  Jacobi)  and  5.4  (quadratic  Jacobi)  illustrate  the  computed  solution  u6 

and  the  corresponding  error  u  -  u6.  The  respective  maximum  norms  of  the  errors  are  9xl0'6  and 
3.6xl0'6.  These  results  indicates  the  superiority  of  the  methodology  combining  wavelet  approximation 
and  quadratic  Jacobi  relaxation  procedures. 

Experiments  concerning  the  multilevel  wavelet  solution  of  multidimensional  boundary  value 
problems  are  in  progress  and  will  be  reported  in  a  forthcoming  article. 

6  Parabolic  Problems 

This  section  discusses  initial  value  problems  of  the  following  form: 

(6.1)  3u/3t  +  Au  =  f  for  t  >  0, 
and 

(6.2)  u(x,0)  =  Uq(x). 

where  A  is  a  (possibly  nonlinear)  elliptic  operator  in  the  space  variable  x.  For  simplicity  we  will 
assume  that  A,  u  and  f  are  periodic  in  x  with  period  1.  In  contrast  to  the  ordinary  differential  equations 
treated  in  previous  sections,  a  temporal  as  well  as  a  spatial  discretization  is  required.  As  prototype 
problems  of  this  class  we  shall  focus  on  the  linear  heat  equation  (where  A  =  -32/3x2)  and  the 
regularized  Burgers  equation  (where  Au  =  -v  32u/3x2  +  u3u/3x).  Various  finite  difference  time 
discretization  schemes  will  be  combined  with  wavelet  space  approximations  to  solve  the  above  two 
problems. 

6.1  Solution  of  Heat  Equation 

6.1.1  Formulation  of  the  Problem 

We  consider  in  this  section  the  numerical  solution  of  the  following  heat  equation 

(6.3)  3u/3t  -S2 u/3x 2  =  f  for  t  >  0  and  x  e  (0,1), 
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(6.4)  u(0,t)  =  u(l,t),  3u(0,t)  /  3x  =  3u(l,t)  /3x, 

(6.5)  u(x,0)  =  u0(x). 

Denote  the  functions  x  — >  f(x,t)  and  x  — >  u(x,t)  by  f(t)  and  u(t),  respectively.  It  is  assumed  that,  for 
almost  every  t  >  0,  f(t)  s  V*  =  H  1  (R^)  and  u(t)  sV=h'  (Rp)  (where  V  and  its  dual  space  V*  are 
defined  as  in  Section  1.3.3).  A  variational  formulation  is  obtained  by  multiplying  equation  (6.3)  by 
v  e  V  and  integrating  by  parts  with  respect  to  x.  This  yields 
1 

(6.6)  <3u/3t  ,v>  +  J" '  "'x  dv/dx  dx  =  <  f,  v  >,  for  every  v  e  V, 

0 

where,  in  (6.6),  < . , .  >  denotes  the  duality  pairing  between  V*  and  V  which  reduces  to  the  L2  scalar 

7 

product  when  both  arguments  are  in  L  . 

6.1.2  Wavelet  Approximation  and  Time  Discretization  of  Problem  (6.3M6.5) 

Let  N  >  3,  let  n  >  1  and  let  Vn  =  Vn(Rp)  be  the  suspace  of  V  =  H1  (Rp)  as  defined  in  Section  1.3.3. 
Using  relation  (6.6)  as  a  guideline,  we  approximate  problem  (6.3)-(6.5)  by  the  following  problem: 
Find  a  function  u„(t)  satisfying  for  almost  every  t  >  0 
1  1  1 

(6.7)  J  3un(t’)/3t  v  dx  +  J  3u„(t)/3x  dv/dx  dx  =  J  fn(t)  v  dx.for  every  v  e  Vn, 

0  0  0 

(6.8)  un(0)=u0n, 

where  u^  is  the  L2  projection  Pn(uo)  of  the  initial  data  u0  on  Vn  and  where  fn(t)  =  Pn*(f(t))  is  the 
projection  of  f(t)  on  Vn*  (identified  with  Vn  ;  indeed,  fn(t)  is  the  unique  element  of  Vn  satisfying 
1 

J  fn(t)  v  dx  =  <f(t),  v>,  for  all  v  €  Vn). 

0 

Problem  (6.7)-(6.8)  is  equivalent  to  a  system  of  first  order  ordinary  differential  equations  obtained 
by  substituting  v  in  (6.7)  with  the  elements  of  a  basis  of  Vn.  This  system  is  equivalent  to  the 
following  initial  value  problem  in  Vn 


(6.9)  Stip/St  +  ApU,,  =  fn  for  t  >  0, 

(6.10)  un(0)  =  u0n. 
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where  An  =  Pn*A  Pn  is  the  Vn  approximation  of  A  =  -fi/dx1.  Problem  (6.9), (6.10)  has  the  following 
closed  form  solution 

t 

(6.11)  Un(t)  =  exp(-tAn)  uon  +  /  exp(-An(t  -  s))  fn(s)  ds. 

0 

For  simplicity  assume  that  f(t)  and  therefore  fn(t)  are  differentiable.  We  use  a  time  discretization  of 
problem  (6.9), (6.10)  that  results  from  approximating  the  Taylor  series  for  un(t+At)  by  the  following 
quadratic  polynomial  in  At 

(6.12)  UnO+At)  =  u„(t)  +  At  (fn(t)  -  Antin(t)]  +  .5  At2 04(0/31  -  An(fn(t)  -  AnUll(t))] 

This  yields  the  following  explicit  time  stepping  scheme  : 

(6.13)  un°  =  u0n, 

(6. 14)  u^1  =  u„k  +  At  [fnk  -  AnUnk]  +  .5  At2((3fn/9t)k  -  An(fnk  -  AnUnk)], 

which  is  of  Lax-Wendroff  type.  Comparison  with  the  exact  solution  (6.11)  shows  this  scheme  is 
second  order  accurate  with  respect  to  At  Furthermore,  it  can  be  shown  that  this  scheme  is  stable  if  and 
only  if 

(6.15)  At  52/ 

where  Xp.mK  denotes  the  largest  eigenvalue  of  A„.  It  can  be  shown  using  the  same  type  of  analysis 

done  in  Section  1.3.3  that  for  the  (N=3)  wavelet  and  for  large  n  ,  Xnmo„  =  (l/14)22n.  Therefore, 
stability  of  the  wavelet  solution  of  the  heat  equation  obtained  by  the  explicit  time  stepping  scheme 
above  requires 

(6.16)  At  £  .143  h2 

where  h  =  2  "  corresponds  to  the  space  step  size.  The  stability  bound  for  the  corresponding  finite 

2 

difference  (in  space)  scheme  is  ,5h  . 

Remark  6. 1  Scheme  (6. 1 3),(6. 14)  has  the  inconvenience  of  using  the  time  derivative  of  f;  also,  in  its 
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present  form,  it  is  not  well  suited  to  the  solution  of  nonlinear  problems.  Therefore  in  practice  we  shall 
use  the  following  Runge-Kutta  form  of  the  Lax-Wendroff  scheme: 


(6.17)  iv0  =  Uon, 

then  for  k  >  0,  unk  being  known  we  compute  unk+1  via 

(6.18)  wnk+1«  =  u„k  +  .5  At  (fnk  -  AnUnk), 

(6.19)  wnk+»  =  unk+  At(fnk-Anunk), 

(6.20)  u^1  =  wnk+1'2  +  .5  At  (fnk+1  -  Anwnk+I). 

Schemes  (6. 13), (6.14)  and  (6. 17)-(6.20)  coincide  if  f  =  0  and  their  stability  and  accuracy  properties 
are  the  same,  the  last  scheme  being  more  practical  for  obvious  reasons. 

Remark  6.2  For  many  applications  the  stability  condition  (6.16)  imposes  a  prohibitively  large  number 
of  time  steps.  For  such  cases  we  should  choose  an  implicit  time  discetization  scheme  such  as  the  one 
used  for  the  Burgers  equation  in  the  following  Section  6.2.  In  the  case  of  the  heat  equation  (6.3)- 
(6.35)  this  will  lead  to  the  solution  at  each  time  step  of  an  elliptic  problem  for  which  the  methods 
described  in  Sections  3  and  5  still  apply. 

For  more  complete  description  and  analysis  of  numerical  methods  for  initial  value  problems  for 
differential  operators  one  may  consult,  e.g.,  [Raviart,  Thomas,  1983]  and  [Strikwerda,  1989]  (see 
also  the  references  therein). 

6.1.3  Numerical  Experiments 

The  methodology  described  in  Section  6.1.2  has  been  applied  to  the  solution  of  the  heat  equation 
(6.3)-(6.5)  for  f  =  0  and  u0(x)  =  1/(1  -  .5  sin  2nx).  Figure  6.1  shows  the  variation,  over  (0,1),  of  u0 
and  of  the  solution  u  at  time  t  =  .01. 

Figure  6.2  compares,  for  t  =  .01,  the  exact  solution  of  the  heat  equation  (6.3M6.5)  with  the  exact 
solution  uh(t)  of  the  ordinary  differential  system  obtained  from  the  finite  difference  space  discretization 
of  (6.3)-(6.5),  using  h  =  1/256.  This  error  here  is  due  solely  to  finite  difference  space  discretization. 

Figure  6.3  compares,  for  t  =  .01,  the  above  semidiscrete  solution  uh(t)  with  a  fully  discretized 
finite  difference  solution  vh(t)  of  heat  equation  (6.3)-(6.5)  using  the  above  Lax-Wendroff  scheme, 
with  At  =  5x10'*  (approximately  the  maximum  stable  step  size).  This  error  here  is  due  to  time 
discretization  since  the  space  discretizations  are  identical). 

Figure  6.4  compares  the  variation  of  the  functions  Ug,  Pn(ug)  (for  the  N=4  wavelets),  and  un(.01), 


where  here  un(t)  is  the  exact  solution  of  the  semidiscrete  heat  equation  (6.9), (6. 10);  this  figure  has  to 
be  compared  to  Figure  6.1. 

Figure  6.5  compares,  for  t  =  .01,  the  exact  soludon  u  of  the  heat  equation  (6.3)-(6.5)  with  the 
exact  solution  u„  of  the  semidiscrete  heat  equation  (6.9), (6. 10),  (obtained  from  the  N=4  wavelet  spatial 
discretization  with  n  =  5,  i.e.  h  =  1/32). 

Figure  6.6  compares,  for  t  =  .01,  un  above  with  the  fully  discretized  N=4  wavelet  solution  vn(t)  of 
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heat  equation  (6.3)-(6.5)  using  the  above  Lax-Wendroff  scheme,  with  At  =  10  (approximately  the 
maximum  stable  step  size).  Due  to  the  fact  that  N=4  wavelets  provide  approximation  having 
comparable  accuracy  to  finite  differences,  using  fewer  degrees  of  freedom  (e.g.  32  <  256),  the 
maximum  stable  step  size  is  significantly  increased.  This  drastically  decreases  the  required 
computation.  However,  since  the  time  steps  are  larger,  a  more  accurate  (than,  e.g.,  first  order  forward 
Euler)  time  stepping  scheme  is  required  to  balance  the  high  spatial  accuracy  provided  by  the  wavelets. 
Comparison  of  Figures  6.5  and  6.6  indicates  that  the  Lax-Wendroff  scheme  provides  this  balance. 

6.2  Solution  of  the  Regularized  Burgers  Equation 

6.2.1  Formulation  of  the  Problem 

In  this  section  we  will  discuss  the  numerical  solution  of  the  Regularized  Burgers  Equation  defined 
withv  >0by 

(6.21)  3u(x,t)/3t  -  u(x,t)3u(x,t)/3x  =  v  32u(x,t)/3x2  for  t  >0  and  0  <  x  <  1, 

(6.22)  u(x,0)  =  u0(x), 

completed  by  appropriate  boundary  conditions  (Neumann,  Dirichlet,  periodic, ...)  at  x  =  0  and  x  =  1. 
Related  problems  arise  in  many  branches  of  science  and  engineering  particularly  fluid  mechanics  '.’d 
petroleum  resevoir  simulation.  Indeed  this  problem  is  idealy  suited  for  formulating  and  evaluating  new 
numerical  methods  for  eventually  solving  the  Navier  Stokes  Equations. 

6.2.2  Time  Discretization  of  Problem  (6.21), (6.22) 

Let  At  >  0  be  a  time  discretization  step.  We  can  discretize  problem  (6. 21), (6.22)  using,  for 
example,  the  following  semi-implicit  scheme: 

(6.23)  u^Uq, 

then  for  k  £  0  we  compute  uk+1  from  uk  via 
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(6.24)  (uk+l  -  u*)  /At  -  uk3uk+1  /3x  =  V  SV  fd% 


on  (0,1) , 


completed  by  boundary  conditions  at  x  =  0  and  x  =  1.  Scheme  (6.23),(6.24)  is  clearly  semi-implicit 
and  at  each  time  step  it  provides  an  elliptic  problem  similar  to  those  discussed  in  Sections  3  and  5, 
implying  therefore,  that  it  can  be  easily  coupled  to  the  wavelet  based  space  approximation  described 
there.  Numerical  experiments  indicates  that  Scheme  (6.23),(6.24)  has  good  stability  properties  (there 
exist  more  sophisticated  and  accurate  schemes;  in  this  paper  we  limit  ourselves  to  the  above  scheme 
since  it  is  well  suited  for  feasibility  studies). 

6.2.3  A  Space-Time  Adaptive  Wavelet  Method  for  the  Regularized  Burgers  Equation 

Equation  (6.21)  is  a  regularized  version  of  the  in  viscid  Burgers  equation  obtained  by  taking  v  =  0. 
It  is  a  well  known  fact  that  the  solutions  of  this  equation  may  develop  discontinuities  (shocks)  even  if 
the  initial  data  uq  is  very  smooth.  Indeed  for  small  values  of  v  the  solution  may  develop  very  strong 
gradients  making  the  numerical  solution  of  (6.21), (6.22)  a  nontrivial  problem.  Various  methods  can  be 
developed  to  reproduce  accurately  the  fast  variation  of  the  solution  near  the  shock.  Such  methods 
include  adaptive  mesh  refinement,  entropy  control  methods  (see,  e.g.,  [Tadrnor,  1989]  and  references 
therein  for  shock  capturing  techniques).  In  this  section  we  describe  a  space-time  adaptive  wavelet 
method  to  solve  the  regularized  Burgers  equation;  this  method  uses,  at  each  time  step  k,  a  discretization 
of  problem  (6.23),(6.24)  based  on  a  suitably  chosen  subset  Sk  of  scaling  functions  and  wavelets. 
These  scaling  functions  and  wavelets  will  all  belong  to  some  subspace  Vn  (defined  in  previous 
sections)  for  sufficiently  large  n.  The  basic  steps  of  this  adaptive  method  are: 

Adaptive  Method 

(i)  Discretize  in  space  the  elliptic  problem  (6.24)  using  a  (NS3)  wavelet  subspace  Vn  where  n 
is  chosen  to  be  sufficiently  large  so  as  to  represent  accurately  9u/3x. 

(ii)  At  each  time  t  =  kAt,  compute  an  initial  guess  wavelet  solution  wnk+1e  Vn  of  problem 
(6.21), (6.22)  discretized  in  space  using  Vn  and  discretized  in  time  by  any 
explicit  method  (e.g.  Forward  Euler  or  Lax-Wendroff)  from  the  known  solution  unk . 

(iii)  Use  the  inverse  Mallat  transform  (recursively  using  equation  ( 1 .35))  to  calculate  coefficients 
of  the  expansion  of  wnk+1  in  terms  of  a  combination  of  scaling  functions  and  wavelets. 

k 

(iv)  Choose  a  subset  S  of  the  scaling  functions  and  wavelets  appearing  in  the  expansion  of 
wnk+1  in  (iii)  so  that  the  corresponding  truncated  expansion  is  sufficiently  accurate. 


320 


(v)  Calculate  the  solution  unk+1  of  Problem  (6.23), (6.24)  using  the  approximating  subspace 
spanned  by  Sk. 

This  method  provides  a  dynamical  in  time  sequence  of  Galerkin  approximations  of  the  elliptic 
problems  elliptic  problem  (6.24)  for  k  £  0. 

6.2.4  Numerical  Experiments 

The  first  test  problem  is  defined  by  taking  v  =  2xl0’3  in  (6.21),  u0(x)  =  exp(-8(l-x))  and 
(6.25)  du(0,t)  /3x  =  0,  u(l,t)  =  1 , 

as  boundary  conditions.  To  solve  the  above  problem  we  have  been  combining  the  semi-implicit 
Scheme  (6.23), (6.24)  with  a  wavelet  dicretization  based  on  Vn  with  N  =  3  and  n  =  6,  implying  68 

.3 

basis  functions;  the  time  discretization  step  At  =  10  .  The  discrete  Neumann-Dirichlet  problems 
occuring  at  each  time  step  have  been  solved  using  the  Lagrange  multiplier  technique  described  Section 
3.2  to  treat  the  Dirichlet  condition  at  x  =  0.  Figure  6.7  shows  the  solution  at  times  0,  .03,  and  .18  ; 
illustrating  the  development  of  a  quasi  shock  starting  at  t  =  .03  and  fully  developed  at  t  =  .18. 

Lets  describe  now  the  second  test  problem: 

For  computational  convenience  we  have  taken  0  <  x  <  64  and  assumed  periodic  boundary  conditions 
for  u  and  du  /dx  at  x  =  0  and  x  =  64.  The  viscosity  parameter  v  has  been  chosen  equal  to  .5  and  the 
initial  value  uq  is  defined  by  the  piecewise  polynomial  C1  function  shown  in  Figure  6.8.  This  figure 
also  shows  the  solutions  at  5,10,15,20,25,  and  30  time  steps  computed  using  a  finite  difference  in 
space  and  in  time  discretization  with  128  grid  points  (h  =  1/2)  and  a  time  step  At  =  .5.  These  accurate 
solutions  are  used  as  a  reference  to  which  we  compare  our  wavelet  based  solutions. 

Figure  6.9  shows  (upper  left)  the  finite  difference  solutions  at  5,10,15,20,25,  and  30  time  steps 
of  this  test  problem  computed  using  64  grid  points  (h  =  I)  and  time  step  At  =  .5;  and  (upper  right)  the 
corresponding  errors  (compared  against  the  reference  solutions).  Also  shown  (lower  left)  are  the 
(N=3)  wavelet  solutions  of  this  problem  computed  using  n  =  6  (64  basis  functions  -  all  scaling 
functions)  and  a  time  step  At  =  .5;  and  (lower  right)  the  corresponding  errors. 

Figure  6.10  shows  two  sets  of  wavelet  solutions  at  5,10,15,20,25,  and  30  time  steps  of  this  test 
problem  computed  using  (N=3)  wavelet  solutions  of  this  problem  using  the  space-time  adaptive 
wavelet  method  described  in  Section  6.2.3  with  n  =  6  and  a  time  step  At  =  .5.  The  top  right  shows  the 
solutions  obtained  using,  for  all  k,  a  subset  Sk  of  Vn  consisting  of  32  scaling  functions  and  wavelets; 
and  the  top  right  shows  the  corresponding  errors.  The  bottom  right  shows  the  solutions  obtained 
using,  for  all  k,  a  subset  S  of  Vn  consisting  of  16  scaling  functions  and  wavelets;  and  the  bottom  right 
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shows  the  corresponding  errors. 

Comparing  the  errors  shown  in  Figure  6.9  indicates  that  the  wavelet  discretization  provides 
accuracy  comparable  to  finite  differences  with  the  same  number  of  degrees  of  freedom  if 
approximation  by  scaling  functions  in  Vn  is  used.  The  errors  shown  in  the  top  right  of  Figure  6.10 
indicates  that  approximation  by  a  combination  of  scaling  functions  and  wavelets  can  achieve  the  same 
accuracy  using  significantly  fewer  degrees  of  freedom.  However,  the  errors  in  the  bottom  right  of 
Figure  6.10  indicate  that  a  significant  reduction  of  accuracy  can  result  if  too  few  degrees  of  freedom 
are  utilized.  We  are  currently  investigating  improved  space-time  adaptive  wavelet  methods. 

7  Linear  Advection  Problem 

This  section  discusses  the  following  initial  value  problem: 


(7.1) 

3u/3t  =  3u/3x 

for  t  >  0  and  x  £  (0,1), 

(7.2) 

u(l,t)  =  0 

for  t  >  0, 

(7.3) 

u(x,0)  =  uo(x). 

7.1  Solution  of  Linear  Advection  Equation 

We  solved  problem  (7.1)  using  first  an  explicit  Lax-Wendroff  time  discretization  to  obtain  the 
following  semi  discrete  problem: 

(7.4)  u°  =  u0, 

(7.5)  uk+1  =  uk  +  At  3unk/9x  +  .5  AtVu^/Sx2, 

followed  by  the  wavelet  space  discretization  to  obtain  the  following  scheme 

(7.6)  un°  =  u0n. 

(7.7)  u„k+1  =  u„k  +  At  AnUnk  +  .5  At2Bn2unk, 

where  An  ,  B„  represent  the  wavelet  approximations  of  the  operators  A  =  3/3x  and  B  =  d2/dx2  by  the 
subspace  Vn.  Scheme  (7.6), (7 .7)  is  stable  for  sufficiently  small  At  >  0  (of  the  order  of  h). 


7.2  Numerical  Experiment 

Scheme  (7 .6), (7 .7)  was  used  to  compute  the  wavelet  solution  of  problem  (7.1),(7.3)  using  N=3, 
n  =  6  (68  basis  functions),  and  At  =  .001.  Figure  7.1  shows  the  initial  data  (top)  and  the  solution  at 
times  t  =  .410  (middle)  and  t  =  .820  (bottom). 

8  Further  Comments  and  Conclusion 

In  this  paper  we  have  been  exploring  the  potential  of  wavelet  based  approximations  for  the 
numerical  solution  of  boundary  and  initial  value  problems  in  one  space  dimension.  For  this  class  of 
problems,  wavelets  compare  favorably  with  finite  element  and  finite  difference  approximation.  Indeed 
it  is  our  opinion  that  wavelets  share  some  of  the  computational  properties  of  finite  element  and  spectral 
methods  and  that  they  are  well  suited  for  multilevel  solution  methods.  The  generalization  to 
multidimensional  problems  is  nontrivial,  particularly  for  curved  boundaries;  we  think  however  that 
combining  fictitious  domain  methods  methods  with  wavelets  may  lead  to  powerful  algorithms  for 
fairly  general  two  and  three  dimensional  domains. 
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Abstract  -  This  paper  deals  with  systeas  of  linear  algebraic  equations  arising  in 
application  of  the  finite  eleaent  method  to  elliptic  boundary  value  problems  with 
singularly  perturbed  operators.  These  problems  appear  in  utilization  of  iaplicit 
difference  methods  for  solving  parabolic  equations  including  unsteady  convection- 
diffusion  problems.  To  solve  FEM-systeas,  the  paper  suggests  both  iterative  met¬ 
hods  with  multilevel  domain  decoaposition  preconditioners  (DD-preconditioners ) 
and  non-iterative  DD-aethods  with  overlapping  subdoaains.  The  latter  methods 
exploit  the  property  of  fast  exponential  decay  of  grid  Green's  functions  of 
singularly  perturbed  elliptic  operators.  The  justification  and  practical 
implementation  of  the  DD-aethods  suggested  are  discussed. 


1.  INTRODUCTION 

In  recent  years,  the  construction,  Justification  and  practical  implementation  of 
domain  decoaposition  aethods  (DD-aethods)  for  solving  partial  differential 
equations  have  been  causing  an  ever-increasing  interest  among  specialists  in 
nuaerical  mathematics  and  mathematical  modelling.  The  review  of  the  up-to-date 
results  obtained  in  this  field  can  be  found  in  the  Proceedings  of  the  1st  and  2nd 
International  Syaposlua$  on  Doaaln  Decoaposition  Methods  [5,0],  A  great  progress 
was  aade  in  constructing  and  Justifying  aethods  to  solve  elliptic  problems 
[2,3,7,8,11,13,14]. 

This  paper  suggests  two  types  of  algorithms  of  the  DD-aethod  for  approximate 
realization  of  elliptic  difference  schemes  for  unsteady  convection-diffusion 
probleas. 
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The  first  group  of  algorithms  is  baaed  on  the  standard  idea  of  exploiting  a 
positive  definite  preconditioner  in  the  iterative  procedure.  For  such 
preconditioner  the  paper  suggests  to  use  the  Multilevel  OO-preconditioner  for  a 
sy Metric  positive  definite  elliptic  operator  [10],  which  is  chosen  spectrally 
equivalent  to  the  non-syaaetric  coercitlve  operator  of  the  original  problem. 

The  second  group  of  DD-aethods  is  based  on  a  new  idea  of  replacing  the  grid 
problem  in  the  original  doaain  by  a  series  of  grid  problems  for  subdomains  of  a 
smaller  size  [9,12],  These  subproblcms  being  solved,  a  grid  function  is 
constructed  which  approximates  the  grid  function,  solution  of  the  original 
problem,  with  a  prescribed  accuracy.  Section  H  discussed  mathematical  faundations 
for  such  replacement  while  Sections  5-7  consider  specific  versions  of  the 
DD-aethod  with  overlapping  subdomains  and  particular  features  of  their 
implementation. 


2.  PROBLEM  FORMULATION 

Let  Q  be  a  two-dimensional  polygonal  domain  with  the  boundary  31?  and  be  a 

closed  subset  of  30,  consisiting  of  a  finite  number  of  segments  of  stright  lines. 
Let  us  consider  the  unsteady  convection  diffusion  problem 


3 u 
dt 


V(SVu)  +  S»vu  +  cu  »  f  in  O  x  (0;  T]  , 
u  -  0 


„  3u  _ 

«  a  y  *  6U  ■  0 


on  fj  x  (0;7l  , 


on  x  (0:71  , 


(2.1) 


u(0) 


u 


in 


O  . 


Here,  T-  const  >  0;  r 2  18  *  subset  of  90,  consisting  of  a  finite  number  of  open 

F  is  the  vector  of  the  external  normal  to 
0 


segments  (Tj  n  fg  -  0,  Tj  u  rg  -  30)  ; 


fl;  J«  (b ,  b );  a,  b.,  b9,  c,  t,  6  and  a  are  given  bounded  piecewise-smooth 
2  12  12 

functions.  It  is  assumed  that  a  *  ^  »  const  >  0  in  0  and  6  9  0  on  f^. 

Define  F"  {«  »«  J1,  v«  0  on  f  )  as  a  subspace  of  the  Sobolev  space 
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fl*  with  the  sane  nor*  |*|  *  |*|_i  and  the  bilinear  fora 

V  a 

a(  u,  v)  «  S  [efJtfW  +  (%<>vu)v  +  cuv]dP  +  S  6uv  AT  .  (2.2) 

Q  r 

2 

He  assume  the  fora  a(  u,  v)  to  be  positive  seal-definite  on  V. 

Under  the  assumptions  made,  problem  (2.1)  can  be  formulated  as  follows:  find 
u  ■  u(  t)  «  V  such  that  «i(0)  =  iP  and  for  each  t  «  (0;T]  the  following 

identity  is  valid: 

(~.  v)  +  a(u,  v)  =  (/,  v)  Vve  V.  (2.3) 

d  t 

where  (  ,  )  is  an  ordinary  scalar  product  in  the  space  £g(0). 

Let  Q ^  be  a  triangulation  of  O,  such  that  fj  n  belongs  to  the  set  of 

vertices  of  triangles  from  O.  and  V.  be  a  standard  piecewlse-linear  finite 

D  a 

element  subspace  of  F(4].  Apply  to  solving  problem  (2.3)  the  simplest  implicit 
scheme  of  the  first  order  accuracy  in  time  with  the  FBM-approximation  in  spatial 
variables.  The  discrete  problem  can  be  formulated  as  follows:  for  k  -  1 . m 

If 

find  functions  u.  «  V '  such  that 

n  n 

*  *<VV  -  <r'V  V***V  <24) 

Here,  At  -  T/m,  m  is  a  positive  integer  and  denotes  an  approximation  of 

the  function  iP . 

For  the  sake  of  simplicity,  assume  the  funciton  to  belong  to  V^.  Then 

(2.4)  can  be  replaced  with  another  formulation  of  the  discrete  problem,  which  is 

k  k- 1 

more  convenient  for  what  follows:  for  k  -  1 . »  find  m  uk  ~  uk  * 

such  that 

(VV  +  4t'*<VV  -  -*t-Avh)  wfc-  Vh,  (2.5) 

where 

Aw  -  «(«*"1.v)  -  <f,v)  .  (2.6) 
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Problea  (2.5)-(2.6)  for  each  k  >  1  leads  to  the  system  of  linear  algebraic 


equations 

Aw  -  g 

with  a  positive  definite  (in  the  Euclidean  space  **)  to «#  matrix 

A  =  M  +  At-K 


(2.7) 


(2.8) 


and  a  vector  l/1.  Here,  M  is  the  mass  matrix  and  K  is  a  stiffness  matrix 


generated  by  the  form  a(  u,  v) . 

This  paper  is  aimed  at  constructing  methods  of  approximate  solution  of 
system  (2.7)-(2.8)  which  are  based  on  the  domain  decomposition  ideas.  To  this 
end,  we  need  auxiliary  matrices. 

Let  us  prescribed  a  symmetric  elliptic  form 


a(u,  v)  =  /  [avu«vv  +  X*v(uv)  +  cuv]dO  +  S  6uv  dr  ,  (2.9) 

°  r2 

whose  coefficients  a,  X, ,  X  ,  c  and  S  possess  the  properties  similar  to  those 

* 

of  the  coefficients  of  the  form  a(ii,  v).  We  assume  the  form  a(  a,  v)  to  be 
equivalent  to  the  form  a(  u,  v)  in  the  sense  that 

c^a(v,  v)  <  a(  v,  v)  <  cga(  v,  v)  Vv  e  y  ,  (2.10) 

where  and  are  positive  constants. 

Using  the  relations 

(Ku.v)  -  a{uh,vh)  Vuh,vh  «  (u.v®  F?)  (2.11) 

determine  a  symmetric  totK  matrix  and  prescribe  the  Afcff  matrix 

2  -  H  +  .  (2.12) 

It  is  obvious  that 

c.  (Jv,  v)  <  (KV,  v)  <  c„(Xv,  v)  Vv  «  (2.13) 

1  C 

where  and  cg  are  taken  from  (2.10).  This  implies  in  particular  that  if  the 

is  positive  definite  (semi-definite),  then  the  matrix  Tc  will  be  also 
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matrix  K 


positive  definite  (semi-def lnite) .  In  any  case,  the  Matrices  4  and  4  are 
siaultaneously  positive  definite. 

Let  us  formulate  some  statements  whose  proofs  are  directly  implied  by  the 
above  assumptions. 

Statement  2.1  The  following  inequality  is  valid: 

Cj(4v,  v)  $  (Av.v)  $  Cg(4v,  v)  Vv  *s  ft*,  (2.14) 

where  and  ~c ■  are  positive  constants. 

Consider  the  eigenvalue  problem 

RAv  =  4v  .  (2.15) 

Statement  2.2.  The  eigenvalues  of  problem  (2.15)  belong  to  the  rectangle 
[Cj :  c2]»(-d;  <f]  of  complex  plane,  where  Cj  and  c2  are  constants  from 

inequalities  (2.14)  and  d  is  a  positive  constant. 

Apply  to  solving  system  (2.7)  the  iterative  method 

4(11^  nr^1)  =  -cdA**1  -  g)  ,  j  =  1,2,...  (2.16) 

with  a  constant  parameter  or >  0.  Then  Statement  2.2  implies  the  following 

Statement  2.3.  There  exists  or  «  const  >  0  such  that  for  any  or «  (0;ofj 
iterative  method  (2.16)  converges  at  the  rate  of  geometric  progression  with  the 
factor  q  ■  q(or)  -  const  <  1. 

Remark  2.1.  Here  and  henceforth,  we  mean  constants  independent  of  the  grid 

O.  and  of  the  quantity  4t.  Obviously  these  constants  can  depend  on  the 
n 

coefficients  of  the  bilinear  forms,  the  geometry  of  the  domain  O  and  the 
structure  of  the  set  f^. 

Let  us  prescribe  a  symmetric  positive  definite  I* 4  matrix  B  such  that 

CjIBv,  v)  <  (4v,  v)  *'c7(Bv,v)  Vv  •  ft* ,  (2.17) 

where  and  "c.  are  positive  constants,  and  apply  to  solving  system  (2.7)  the 

1  z 
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iterative  method 


fl(  w3  -  t /  1 )  =  -  Aw*  1  -  g)  .  j  =  1,2 .  (2.18) 

with  a  constant  parameter  at  >  0. 

Under  the  assumptions  made,  the  following  statement  is  valid. 

Statement  2.4.  There  exists  'S  =  const  >  0  such  that  for  any  a (0;of] 
iterative  method  (2.18)  converges  at  the  rate  of  geometric  progression  with  the 
factor  q  m  q[ot)  =  const  <  1. 


3.  TWO-LEVEL  DOMAIN  DECOMPOSITION  METHOD 

In  this  section,  we  will  construct  a  two-level  DD-preconditioner  for  the  matrix 

A  of  system  (2.7),  which  will  satisfy  conditions  (2.17)  of  Statement  2.4. 

To  this  end,  we  partition  the  grid  domain  O.  into  grid  subdomains  G_  .,  G„  . 

n  l ,  d  d ,  n 

and  G„  .  as  shown  in  Fig.l.  In  other  words,  we  assume  that  mes(3G.  .  n  3G.  .)  «  0 
3 , n  1,A  3,0 

and  each  subdomain  G.  .  is  partitioned  into  non  overlapping  subdomains 

1 , n  j,  n 

J  -  1,...,#^,  where  ■ ,  i  =  1,2  3,  are  positive  integers.  Obviously  each  of 

G  ,  f  »  1,2,3,  is  a  union  of  triangles  from  O.. 

J ,  A  A 


Figure  1.  Partitioning  O .  into  subdomains  G.  .,  G_  .  and  G„  .. 

A  1,0  2,0  3,  A 
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Let  us 

partition 

the  grid 

and  °Z.h  ’ 

C3,6 

(see 

Fig. 2) 

subdomains 

°1,6 

and 

°Z,b‘ 

matrices 

°l,b  "  Gl,b  U  °2,b 


1,6  2.6' 


a<2>  2 

*  ** 


nz  y  Az 


generated  by  the  bilinear  form  ( o,  v)  *  4t*a(u,v)  for  the  subdomains  ^  and 

which  are  considered  as  superelements  of  the  grid  domain  O..  It  is  obvious  that 

n 


4  „  to 


A  A(1)M(2)  !  A 

d  =  V  /  V  ,  V 


411  412 


421  422 


Finally,  following  the  standard  technique  of  domain  decomposition  methods  with 
alternating  Neumann-Dirichlet  boundary  conditions  [2,7,11,14]  we  define  the  Afc# 
matrix  C  h ] 


fl  =  A  A  ^A  '  A 

yi  y  yz  z  zy  i  yz 


~Bn'AizAzlAzi  \z 


i s  the  stiffness  matrix  for  the  subdomain  O.  ..  We  call  the  matrix  i  from 

1  i  A 

(3.3)  the  one-level  OD-precondltioner  for  the  matrix  A  from  (3.2). 
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Continue  the  procedure  of  construction  of  the  two-level  DD-precondltloner . 
To  this  end,  set  m  °i  b  and  ^  ”  2^.  Then  partition  0 ^  Into  two  grid 

subdoaalns  ^  ft  “  ®i  ft  and  ft  “  C2  ft  *8ee  with  the  coaaon  boundary 

4 ' 3Ci.  ^  n  ^  and  define  the  stiffness  Matrices 


11 


A 

A* 

A 

a!2>  V, 

1 

1/ 

and 

A 

A* 

1 

42>  *2 

(3.5) 


generated  by  the  bilinear  fora  (<i,  v)  +  4t*a(u,v) 
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for  the  subdoaains  and  tf 

i.b  8, ft 


which  are  considered  as  superelements  of  the  grid  doaain  O,.  It  is  obvious  that 

a 


\  V,  ,  0 

r  ^ 

^  1 

A~  WKW  !  A. 
yi  1  i  |  12 

11  12 

°  \~i  '  y 

"21 

422  J 

Finally,  we  define  the  matrix 


B  = 


A  /S  A-]  A  A 

B11+A12A22A21  A12 


21 


422 


(3.6) 


(3.7) 


as  the  one-level  DD-preconditioner  for  the  Matrix  A  from  (3.6). 

The  resulting  two-level  OO-preconditioner  [10]  for  the  matrix  A  froa  (3.2) 
will  be  definetlby  the  formula 


^~A\2A22A2\  ~A\2 


21 


422 


(3.8) 


The  eigenvalues  of  the  matrices  ~B  *A  and  S  *4  are  known  to  belong  to  the 
segments  [l;d]  and  [l;tl],  respectively,  where  <f  and  are  positive  integers 
greater  than  unity.  It  is  easy  to  show  [10]  that  in  this  case  the  eigenvalues  of 
the  matrix  B  *A  belong  to  the  segment  [ltd]  where  d»  <&.  Under  certain 
conditions  in  which  the  so-called  extension  theorem  [1,15,16]  is  valid,  the 
numbers  cf,  d  and,  hence,  d  are  independent  of  the  grid  O.  and  the  grid  step 

a 

size  At.  Now  we  formulate  so«e  requirements  imposed  on  the  sizes  of  subdowains 

under  which  the  quantities  3  and  d  will  be  independent  of  the  grids. 

For  each  of  the  simply  connected  subdomains  we  define  Inscribed  and 

1*0 

circumscribed  circles  with  the  radii  and  R^\,  respectively.  Assume  that 

1,0  1,0 

positive  constants  et^,  and  *  exist  such  that 

aCjfdt)*  <  *  a2(At)B 
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and  for  each  subdoaain  there  exists  its  Mapping  onto  the  square  with  the 

side  length  (dt)*,  which  is  given  by  a  function  from  H ^  and  has  an  inverse 
■apping  bounded  by  a  constants. 

Under  the  assuaptions  Made,  the  following  stateaent  is  valid. 

Statement  3.1.  For  any  0.5  the  eigenvalues  of  the  matrix  B  *d  belong 

to  the  segment  [1  ;d],  where  d  is  a  positive  constant. 

This  statement  and  Statement  2.4  directly  imply 

Statement  3.2.  If  for  the  preconditioner  B  for  the  matrix  A  of  system 
(2.7)  we  choose  the  matrix  B  from  (3.8)  provided  0.5,  then  there  exists 

et  =  const  >  0  such  that  for  any  o (0;or]  iterative  method  (2.18)  converges 
at  the  rate  of  geometric  progression  with  the  factor  q  m  q(ar)  =  const  <  1. 


4.  GRID  GREEN  FUNCTION  ESTIMATE 


Statement  3.2  implies  that  for  approximate  solution  of  system  (2.7)  by  method 
(2.18)  with  accuracy  (0;1)  in  the  sense  of  the  inequality 


'•’v"  ‘ 


(4.1) 


being  valid,  it  is  sufficient  to  choose  J *  c^ln  c  *,  where  cft  is  a  positive 


constant. 


Let  Q.  be  a  subdomain  of  the  mesh  domain  O.  and  supp  e.  c  O. .  Embed 
D  ann 

O.  into  a  subdoaain  O.  _  <=  O.  (see  Fig. 4)  such  that  any 

a  n,  c  n 

x  -  «  Qb  e  -  satisfies  the  inequality 


P(x;0.)  =  min  |x  -  yl  £  c(dt)*ln  C  1 , 
n  ^ 


(4.2) 


where  c  is  a  positive  constant  and  m  =  0.5, 


Statement  4.1.  Under  the  assumptions  made,  a  positive  constant  c  exists 

£ 


J Q  i 

such  that  m.  -  0  for  any  *  ■  Q.  _  (see  Fig.*#). 
"  o,C 
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Pigure  4.  Interpretation  of  the  behaviour  of  the  grid  Green  function. 


This  statement  and  inequality  (4.1)  directly  imply  validity  of  the 
inequality 


l»V  ln  .  S  etgj 

L2  (Qh.C]  h  ^(O) 


(4.3) 


from  which  in  case  of  regular  meshes  (A~  AT1'2)  we  have 


'■V*>  ‘ 


(4.4) 


We  have  thus  proved  that  when  the  distance  from  the  point  of  location  of  the 
source  function  increases  the  grid  Green's  function  of  the  operator  M  *  At‘K 
decreases  as  exp(-const  | y-  x\/JAt).  This  fact  is  ln  complete  harmony  with  the 
results  obtained  before  for  the  case  of  symmetric  bilinear  forms  [9,12], 
specifically  for  diffusion  problems. 

In  practice,  as  a  rule,  one  chooses  c  «•  ( At )8,  where  •  «  [  1 ; 2] .  In  this 

case,  the  boundary  of  the  domain  ^  £  lies  away  from  the  boundary  of  the  domain 

2>.  at  the  distance  of  order  (dt^nfdt)  1 ,  which  is  for  small  values  of  At 
a 

considerably  less  than  the  diameter  of  the  domain  O. 
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Note  that  fr am  the  standpoint  of  the  arithmetic  cost  of  realization  of  one 
step  of  method  (2.18)  it  is  preferable  to  choose  a*  1/2.  i.e.  to  choose  the 
greatest  value  of  0  for  which  this  method  has  the  convergence  rate  with  the 
factor  q  Independent  of  the  grid  There  is  no  reason  for  supposing  that  if 
the  value  of  *  is  decreased,  the  convergence  rate  of  method  (2.18)  can  be 
considerably  improved. 


5.  NON-ITERATIVE  DD-METHOD  WITH  OVERLAPPING  SUBDOMAINS 

The  previous  section  leads  us  to  a  new  approach  to  constructing  DU-methods  for 
approximate  solution  of  system  (2.8).  This  approach  was  anounced  and  justified 
for  diffusion  problems  in  [9,12].  Here,  we  will  consider  two  versions  of  this 
approach  under  the  assumption  that  the  grid  O ^  is  regular. 

Let  us  partition  the  grid  domain  O.  into  p  >  1  grid  subdomains 

. o[p)  as  shown,  for  example,  in  Fig. 5  and  determine  vectors  g 

by  restricting  linear  form  (2.6)  onto  subdomains  .  I  ■  1 . .  In  other 

words,  we  determine  functions  gj^  «  such  that  supp  g «  G*^ , 

1  ’  1 . p-  “nd  f[l)  -  *„ 
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The  linearity  of  system  (2.8)  implies  that 


nr  * 


2  *(l)  . 

t=  l 


(5.1) 


where 


Jit 


are  solutions  to  the  systems 


Aw 


(  l)  -  gl  L) 


1 ,  . 


Our  ala  is  to  coapute  the  vector  a  with  accuracy  C 
to  find  the  vector  a*  ✓  satisfying  the  inequality 


(5.2) 

in  the  unifora  nora,  i.e. 


la  -  C  .  (5.3) 

Fix  the  value  of  l  *  1  and  embed  the  subdomain  ^  into  the  grid  subdomain 

u 

such  that  all  points  x  «  *  0AO?  satisfy  the  inequality 

d,c  n,c  n  q,c 

P(X-,c[l))  ?  c(/lt)1/2ln^  y  j  '  (5.4) 

with  a  constant  c  (see  Fig. 6).  Denote  by  A^1^  the  stiffness  aatrix  for  the 

subdoaain  considered  as  a  supereleaent  and  by  g^  the  restriction  of 

Ot  C  C 

the  vector  g^  ^  onto  the  saae  subdoaain,  and  consider  the  systea 


(5.5) 


Figura  6.  Eabeddlng  into  ci1* 

a  a,  e 
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Finally,  denote  by 


~<  D 


the  vector  fro» 


whose  restriction  onto 


is 

fl,  c 


equal  to  the  vector  and  onto  (f[_ ^  is  the  zero  vector. 

C  fl,  c 


Statement  5.1.  There  exists  a  value  of  the  constant  c  froa  inequality  (5.3) 
id) 


such  that  the  vector  m 


|w(i> 


This  iaplies  that  the  vector 


constructed  above  satisfies  the  inequality 

.(I), 


<?• 


(5.6) 


1  " 

l-l 


-(  D 


(5.7) 


satisfies  inequality  (5.3),  i.e.  approximates  the  solution  to  system  (2.8)  with 
accuracy  C  in  the  uniform  norm. 

Let  us  briefly  discuss  the  other  version  of  the  approach  to  constructing 
DO-aethods  with  overlapping  subdomains.  Choose  a  subdomain  of  the  grid 

domain  and  formulate  the  problem:  find  the  components  of  the  vector,  the 

solution  to  system  (2.7),  corresponding  to  this  subdomain  with  accuracy  C.  In 

m  onto  the 

subdomain  G.  then  it  is  necessary  to  find  a  vector 
following  inequality  is  valid: 


other  words,  If  we  denote  by  w &  the  restriction  of  the  vector 


Wg  c  such  that  the 


wgJ«  «  C  ' 


To  solve  the  problem  formulated,  embed  the  subdomain  G. 


h,  c 


such  that  the  Inequality 


f(x;Cft)  *  cUt)1/2ln(eh'J) 


(5.8) 

into  the  subdomain 

(5.9) 


is  valid  for  any  point  x  •  Q.  „  -  PAG.  Then  denote  by  the  stiffness 

B,  E  B  B,  E  E 


matrix  for  the  domain  G.  and  by 

JVf  E 


r_  the  restriction  of  the  vector  on  the 

c 


right-hand  side  of  system  (2.7)  onto  the  same  subdomain,  and  consider  the  system 

*c»c  *  *c  <510) 
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i 


Finally,  denote  by  _  the  restriction  of  the  vector  onto  the  subdonaln 

G,C  C 

6t 

Statement  5.2.  There  exists  a  value  of  the  constant  c  f roa  inequality 

(5.9),  such  that  the  constructed  vector  w  satisfies  inequality  (5.8). 

w,  C 

It  is  obvious  that  choosing  various  subdoaains  of  the  doaaln  in  a 

successive  or  parallel  way  we  can  coapute  all  the  coaponents  of  the  vector,  the 
solution  to  the  systea  (2.7),  with  a  given  accuracy.  When  solving  probleas  for 
subdoaains  we  can  use  different  direct  and  iterative  aethods.  The  estlaates  of 
the  total  coaputational  cost  in  case  of  the  diffusion  problea  and  rectangular 
piecewise-unifora  aeshes  were  given  in  [9], 


6.  CONVECTION-DIFFUSION  PROBLEMS  FOR  SUBDOMAINS 


In  DD-aethods  with  overlapping  subdoaains  considered  in  the  previous  section,  it 
is  necessary  to  solve  repeatedly  convection-diffusion  probleas  in  subdoaains  of  a 
saall  size.  Below,  we  will  note  a  characteristic  property  of  such  probleas  which 
enables  us  to  construct  very  efficient  coaputational  algorithas  to  solve  then. 

For  the  sake  of  siaplicity,  we  confine  ourselves  to  considering  the 
differential  problea  in  the  square  G  with  the  side  length  (dt)*,  where 
»«  (0;  1/2] .  Thus,  let  us  consider  the  equation 


w  +■  dt-(-v(avw)  ♦  tfVw]  *  g  in  G  , 
w  »  0  on  9G 


(6.1) 


provided  dlv  1*0. 

Map  the  square  G  onto  the  unit  square  Cg,  i.e.  Introduce  new  variables 
x  -  (At)~*x.  As  a  result,  we  arrive  at  the  differential  problea 


w+  (dt)1  2*[-v(*7)w+  (dtJ^T^Vw]  -  g  In  G- 


w  »  0  on  80- 


(6.2) 
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This  implies  that  with  respect  to  saall  subdomains  the  role  of  the  convection 

term  in  equation  (6.1)  becoaes  considerably  less  as  compared  with  the  diffusion 

tera  than  with  respect  to  the  original  doaain. 

Apply  to  solving  the  systea 

Aw  =  g  (6.3) 

arising  as  a  result  of  the  FEN-approxlaation  of  the  problem  (6.1)  the  iterative 
method 

BiW*  -  i^"1)  =  -or(  Aw"*1  -  g)  ,  j  =  1,2 .  (6.4) 

with  a  constant  parameter  a  >  0  and  the  matrix  B  =  |(A  +  4T ) . 

Statement  6.1.  Under  the  assumption  made,  there  exists  or  =  const  >  0  such 

that  for  any  of  (0;or]  method  (6.4)  converges  at  the  rate  of  geometric 
progression  with  the  factor  q  =  c(At)  <  1,  where  c  is  a  positive  constant. 

To  solve  system  (6.3),  it  is  thus  possible  to  suggest  different  iterative 

T 

methods  with  the  preconditioners  B  spectrally  equivalent  to  the  matrix  A  +  A  . 

In  this  case,  the  smaller  the  diameters  of  the  subdomains  G.,  the  faster  the 

A 

convergence  rate  of  the  method. 


7.  CONCLUSION 

Since  this  paper  does  not  contain  proofs  in  detail  and  considers  only 
two-dimensional  problems,  we  would  like  to  make  some  comments. 

The  fact  of  exponential  decay  of  grid  Green's  function  of  the  matrix 
M  *  At- K  is  known  to  many  scientists,  and  in  number  of  cases  it  can  be 
established  analytically  or  by  using  obvious  arguments  [9],  However,  insufficient 
attention  has  been  paid  so  far  to  the  application  of  this  prperty.  In  our  view, 
with  the  requirement  for  high  precision  of  numerial  computations  rising  and, 
hence,  with  finer  spatial  grids  and  smaller  values  of  At  used,  the  importance 
of  domain  decomposition  methods  will  be  Increasing.  As  seen  from  Section  5,  these 


3A7 


■ethods  are  very  convenient  for  implementation  on  parallel  computers. 

The  results  of  this  paper  can  be  considerably  improved  if  we  impose  some 

additional  constraints  on  the  forms  a(u,  v)  and  the  grids.  These  results  can  be 

extended  in  an  obvious  way  to  the  three-dimensional  case.  In  Section  3,  for 

example,  the  two-level  DD-preconditioner  can  be  replaced  by  the  three-level  one 

with  the  domain  O.  partitioned  into  cubes.  The  implicit  scheme  of  the  first 
a 

order  accuracy  in  time  can  be  replaced  by  the  Crank-Nicholson  scheme  and  all  the 
main  results  remain  valid.  All  these  problems  still  need  be  theoretically  and 
practically  investigated  in  details. 
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Abstract.  In  this  paper  we  very  briefly  review  the  Generalized  Newton  Vari¬ 
ational  Principle  for  3-dimensional  quantum  mechanical  reactive  scattering.  Then 
three  techniques  are  described  which  improve  the  efficiency  of  the  computations. 
First  we  use  the  fact  that  the  Hamiltonian  is  Hermitian  to  reduce  the  number  of 
integrals  computed,  and  then  we  exploit  the  properties  of  localized  basis  functions 
in  order  to  eliminate  redundant  work  in  the  integral  evaluation.  In  addition  we 
suggest  a  new  type  of  localized  basis  function  with  desirable  properties.  Finally  we 
show  how  partitioned  matrices  can  be  used  with  localized  basis  functions  to  reduce 
the  amount  of  work  required  to  handle  the  complex  boundary  conditions.  The  new 
techniques  do  not  introduce  any  approximations  into  the  calculations,  so  they  may 
be  used  to  obtain  converged  solutions  of  the  Schrodinger  equation. 
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1.  Introduction 


There  axe  several  approaches  to  solve  multidimensional  quantum  mechanical 
scattering  problems.  The  most  widely  studied  practical  methods  in  physical  chem¬ 
istry  are  based  on  writing  the  solution  of  the  time-independent  Schrodinger  equa¬ 
tion  with  nonhomogeneous  boundary  conditions  as  an  expansion  over  products  of 
unknown  non-square-integrable  radial  relative  translational  functions  and  known 
square-integrable  internal-orbital  functions  [1,2].  This  leads  to  coupled  ordinary 
differential  equations  for  the  radial  functions,  and  these  are  usually  solved  by  prop¬ 
agation  along  the  radial  scattering  coordiante  [3].  In  atomic  and  chemical  physics 
this  is  usually  called  the  close  coupling  method.  Each  internal-orbital  function  is 
called  a  channel,  and  the  number  of  coupled  equations  equals  the  number  of  cou¬ 
pled  channels.  In  1979  a  workshop  [4]  was  held  comparing  most  of  the  available 
specialized  techniques  for  solving  these  equations  to  each  other.  In  addition  they 
were  compared  to  a  widely  used,  state-of-the-art,  general-purpose  variable-order, 
variable-stepsize  predictor-corrector  (PC)  algorithm  [5],  Interestingly,  adding  the 
computer  times  to  solve  four  test  cases,  the  PC  algorithm  rated  15,h  out  of  15 
schemes  tested,  with  a  computer  time  19  times  greater  than  the  best  specialized 
scheme.  This  shows  the  great  utility  of  special  methods  and  the  great  progress 
achieved  in  developing  highly  accurate  specialized  techniques  for  atomic  and  molec¬ 
ular  collisions. 

The  four  trial  problems  used  to  compare  algorithms  in  the  1979  workshop  in¬ 
volved  15-22  coupled  channels  [4].  With  supercomputers  and  efficient  vectorization 
[6]  and  storage  management  [7]  strategies,  much  larger  single-arrangement  problems 
have  been  treated  successfully,  e.g.,  a  four-body  problem  with  a  realistic  potential 
function  and  1358  coupled  channels  has  been  solved  [8]  by  the  technique  discussed 
above  of  propagation  along  a  scattering  coordinate,  and  a  model  problem  involving 
the  scattering  of  an  atom  by  a  corrugated  crystal  surface  with  18711  channels  has 
been  treated  successfully  [9]  by  a  time  propagation  algorithm  [10]. 

Rearrangement  collisions,  however,  pose  special  difficulties.  For  example,  we 
will  consider  a  rearrangement  collision  consisting  of  the  reaction  of  an  atom  A  with 
a  diatom  BC:  A  scatters  onto  BC,  a  rearrangement  (chemical  reaction)  occurs, 
and  AB  scatters  off  C,  or  AC  scatters  off  B\  the  solution  also  contains  terms  and 
boundary  conditions  correspor  ding  to  A  scattering  nonreactively  off  BC.  To  define 
a  single  propagation  coordinate,  we  must  either  introduce  special  coordinates  that 
complicate  the  differential  operators  and  the  boundary  conditions  [11-19],  or  one 
must  introduce  nonlocal  potential  operators  that  convert  the  differential  equations 
into  integrodifferential  equations  [20-22].  Non-propagative  methods  for  rearrange¬ 
ment  scattering  have  been  developed  to  avoid  these  problems,  especially  for  electron 
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scattering  (e.g.,  electron-helium  scattering  in  which  the  incident  electron  may  ex¬ 
change  with  either  bound  electron  of  helium).  Because  of  the  essentially  infinite 
ratio  of  nuclear  and  electronic  masses  and  the  simplicity  of  the  coulomb  potential, 
the  coordinates  and  nonlocal  potential  operators  both  become  particularly  simple 
for  electron  scattering.  A  variety  of  basis  set  techniques — nonorthogonal  spectral 
methods — have  been  developed  and  successfully  applied  to  such  problems  [23-29]. 

Chemical  reactions  tire  harder  to  treat  than  electron  scattering  not  only  be¬ 
cause  the  coordinate  transformations  and  potential  functions  are  more  complicated 
(potential  functions  for  chemical  reactions  typically  involve  hundreds  of  lines  of  code 
as  contrasted  to  the  simple  coulomb  potential  that  completely  suffices  for  electron- 
atom  scattering  in  nonrelativistic  problems),  but  also  because  the  de  Broglie  wave¬ 
lengths  are  smaller,  t.e.,  there  is  more  stucture  in  the  solutions.  In  the  last  few 
years  we  have  developed  a  variational  nonorthogonal  spectral  method  for  chemi¬ 
cal  reactions  [30,31],  and  we  have  applied  it  to  obtain  converged  solutions  to  the 
time-independent  Schrodinger  equation  with  rearrangement  scattering  boundary 
conditions  and  up  to  844  coupled  channels  [32].  Our  method  is  based  on  a  gen¬ 
eralization  of  a  variational  principle  due  originally  to  Newton  [33,34],  and  it  is 
based  on  expanding  the  reactive  amplitude  density  in  a  nonorthogonal  basis  set. 
Other  basis  set  variational  methods  [35-41]  based  on  the  Kohn  variational  princi¬ 
ple  [42]  and  sharing  many  attractive  features  in  common  with  our  approach  have 
also  been  proposed  recently.  All  these  approaches,  as  well  as  new  nonvariational 
basis  set  techniques  [43,44]  are  very  encouraging  for  improving  the  computational 
efficiency  of  nonorthogonal  spectral  methods  for  the  quantum  dynamics  of  reac¬ 
tive  scattering.  The  present  paper  is  concerned  only  with  the  approach  based  on 
the  generalized  Newton  variational  principle  (GNVP).  The  theoretical  formulation 
of  the  method  [31]  and  some  computational  improvements  [42]  are  presented  else¬ 
where.  The  present  paper,  after  a  brief  overview  of  the  working  equations,  describes 
three  additional  computational  improvements,  including  improved  use  of  inherent 
symmetries  and  a  new  more  localized  basis  set. 

The  final  variational  equations  in  the  method  described  here  may  be  obtained 
in  several  ways.  They  were  originally  derived  by  applying  the  GNVP  to  the  problem 
posed  as  a  set  of  coupled  Fredholm  integral  equations  of  the  second  kind  [10,31,45]. 
(They  can  also  be  obtained  from  the  formulation  of  the  problem  as  coupled  in- 
tegrodifferential  equations  [20,21,46]  or  with  a  scattered  wave  or  outgoing  wave 
variational  principle  [47-50]  based  directly  on  the  Schroodinger  equation  with  non- 
homogeneous  boundary  conditions.)  There  are  many  general  techniques  for  solving 
Fredholm  integral  equations  of  the  second  kind  in  a  single  variable  [51].  Just  as 
the  specialized  techniques  developed  in  physical  chemistry  for  single-arrangement 
scattering  are  much  more  efficient  than  general-purpose  predictor-corrector  algo- 
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rithms  for  coupled  ordinary  differential  equations,  we  believe  that  the  specialized 
techniques  developed  for  solving  the  coupled  equations  describing  reactive  scatter¬ 
ing  are  also  more  efficient  than  general  techniques  developed  previously  for  solving 
coupled  integral  equations. 

Spectral  techniques  are  now  widely  employed  for  solutions  of  problems  in  fluid 
dynamics  [52-55].  The  choice  of  basis  functions  in  these  applications  is  very  critical, 
and  the  same  is  true  for  treating  reactive  scattering.  One  possibility  is  to  choose  ba¬ 
sis  functions  to  minimize  the  number  of  nonzero  weights  in  the  quadratures  leading 
to  a  given  matrix  element;  this  kind  of  consideration  leads  to  using,  e.g.,  Lobatto 
functions  [40],  Another  possibility  is  to  choose  the  basis  functions  to  allow  more 
efficient  quadratures  by  using  fast  Fourier  transforms  or  fast  cosine  transforms;  the 
latter  can  be  accomplished,  e.g.,  by  using  Chebyshev  basis  functions  [55].  A  third 
possibility  is  to  use  basis  functions  that  reduce  the  number  of  integrals  and  the  time 
to  solve  the  final  coupled  linear  equations;  this  has  been  our  philosophy  so  far.  In 
the  present  work  we  discuss  a  new  consideration,  namely  a  choice  of  basis  func¬ 
tion  that  minimizes  the  calculations!  effort  in  obtaining  the  half-integrated  Green’s 
functions  that  enter  the  GNVP  and  in  the  integrations  over  these  functions. 

2.  Theory 

First  we  outline  the  formalism  of  the  basic  calculations,  and  then  the  improve¬ 
ments  are  described  in  detail. 

2.1.  General  Equations 

We  consider  atom-diatom  reactive  scattering  with  three  arrangements:  a  =  1 
denoting  A  +  BC,  a  =  2  denoting  B  +  AC,  and  a  =  3  denoting  C  -I-  AB.  The 
formalism  and  notation  are  the  same  as  used  previously  [31,42].  The  Schrodinger 
equation  is 

(H-E)*n-=0  (1) 

where  H  is  the  Hamiltonian,  E  is  the  total  energy,  and  #"•  is  the  wave  function 
with  complex  nonhomogeneous  boundary  conditions  corresponding  to  an  incoming 
wave  in  channel  n0  and  outgoing  waves  in  the  other  channels.  The  wave  function 
determines  the  scattering  matrix,  a  complex  matrix  of  scattering  amplitudes  from 
which  all  physical  observables  of  the  scattering  process  may  be  calculated.  Although 
the  boundary  conditions  on  the  final  solution  are  complex,  to  correspond  to  the 
physical  conditions,  we  form  the  solution  to  the  problem  in  such  a  way  that  most 
of  the  computations  involve  real  quantities.  The  channel  label  n  is  a  collective 
index  denoting  arrangement  a  and  internal  quantum  numbers  specifying  a  channel 
in  that  arrangement.  The  initial  channel  and  initial  arrangement  (or — sometimes 
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below — any  special  channel  and  its  corresponding  arrangement)  are  denoted  na  and 
at0,  respectively.  We  use  conservation  of  total  angular  momentum  J,  parity,  and 
arrangement  symmetry,  if  present,  to  block  diagonalize  the  problem,  and  all  further 
considerations  refer  to  a  single  block  of  N  coupled  channels. 

Equation  (1)  is  rewritten  in  three  different  ways  (q  =  1,2, ...  ,3)  as 

(H?  +  V°  -  E)*n‘  =  0  (2) 

where  H®  is  a  called  the  distortion  Hamiltonian.  It  contains  the  kinetic  energy  and 
a  part  of  the  potential  that  only  couples  channels  in  the  same  arrangement  (with 
subblocks  called  distortion  blocks),  and  contains  the  rest  of  the  potential.  First 
we  define  the  regular  solutions  ^“»n»  of 

(H°  -  EWn‘  =  0  (3) 

for  various  possible  initial  channels  n„  and  the  principal  value  Green’s  functions 
defined  by 

Gf  =  V(E  -  H?)  (4) 

for  all  three  arrangements.  These  are  called  the  distorted  waves  and  the  distorted 
wave  Green’s  functions.  Then  we  apply  the  GNVP  to  solve  for  the  remaining  cou¬ 
pling  due  to  the  V£  potentials.  This  is  accomplished  by  expanding  the  reactive 
amplitude  density  [31,45]  in  a  square-integrable  (£2)  basis  of  functions  $g,  with 
0  —  1,2,...,  Af.  Each  basis  function  is  a  product  of  a  radial  translational  function 
where  Ra/I  is  the  radial  translational  coordinate  in  arrangement  aB, 
and  an  internal-orbital  function  <j>n*  corresponding  to  channel  ng  in  this  arrange¬ 
ment.  The  GNVP  tnen  leads  to  a  matrix  equation  for  the  scattering  matrix  in 
which  the  matrix  elements  are  integrals  over  V£ ,  G®,  and  G^V^G®,  sandwiched 
between  the  various  ipn  and  $g. 

An  important  computational  aspect  of  the  resulting  equations  is  that  every 
Green’s  function  always  appears  in  an  integrand  multiplied  by  a  basis  function  in 
the  same  arrangement.  Thus  we  never  compute  the  Green’s  functions  themselves, 
but  only  a  set  of  integrals  over  these  functions.  These  integrals  are  called  half- 
integrated  Green’s  functions  (HIGFs).  The  radial  distorted  waves  and  radial  HIGFs 
are  computed  by  a  finite  difference  boundary  value  method  (FDBVM)  [45,56],  using 
an  irregular  mesh  that  contains  Gauss-Legendre  quadrature  nodes  for  all  further 
quadratures  of  these  quantities  as  a  sub-mesh  [31].  This  avoids  the  difficulties 
[57,58]  of  integrating  over  functions  with  discontinuous  slopes. 

Then  the  scattering  matrix  is  given  by  [42] 

Snn,  =  Snn.  (5) 
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where  °S  is  the  scattering  matrix  for  the  distortion  potential,  £  is  the  correction 
produced  by  the  remainder  of  the  potential,  and  or„  is  the  arrangement  associated 
with  channel  n.  (We  ignore  the  notational  complications  due  to  the  presence  of 
closed  channels,  for  these  details  play  no  important  role  in  the  subsequent  discus¬ 
sion.)  The  correction  to  the  scattering  matrix  due  to  the  coupling  potential  is  given 
by 

|  =  |s  +  BrC-1B,  (6) 

where  the  matrices  in  Eq.  (6)  are  given  by 

£b  =  AT£BA,  (7) 

B  =  (B  +  XT£B)A,  (8) 

and 

C  =  C  -  BX  -  Xr|  +  Xr2  -  XT£BX.  (9) 

A  is  the  transformation  which  takes  the  regular  distorted  waves  from  real  standing 
wave  boundary  conditions  to  complex  scattering  matrix  boundary  conditions,  X  is 
the  transformation  required  to  form  the  outgoing  wave  HIGF  given  the  real  function, 
and 

/  dRa.  £  A°:nFnn.(Ra. ) ).  (10) 

J  n' 

Ben.  =  f  dRa,  £  K:nCf>n'(Ra. )  (r)fX.(Ra.),  (11) 

"Bnff.  —  J  dR0'Tnnt'  (Ra.)tmfcnt'(R°ic  )>  (12) 

Bnff,  —  [  dRg,  ^n'iy^l»'(^°.  )  9n’3.(Bg.  ),  (13) 

*  n' 

and 

=  f  dRa,T0nt'(Ra,)tmt.nf,  (-^o.) 

,  (14) 

-  /  dRa.  £ Krn,Gen'(Ro.)9Z0.(Ra.). 

A  three-body  system  involves  three  internal  coordinates,  and  these  equations  involve 
the  final  (third)  integration  of  the  quadrature  over  these  coordinates.  The  matrix 
element  is  one  if  channel  n  and  channel  n0  are  both  members  of  the  same 

distortion  block  of  arrangement  a0  and  are  zero  otherwise,  >s  the  regular 

radial  function  for  the  distortion  potential  defined  by  rj  and  n0  and  satisfying  real 
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boundary  conditions  (».e.,  it  is  one  of  the  radial  relative  translational  functions 
obtained  by  solving  a  set  of  close  coupling  equations  for  t/>"*),  0O  is  a  specific  basis 
function  associated  with  arrangement  a„,  and  g%g  is  the  radial  part  of  the  real 
HIGF  associated  with  the  distortion  block  containing  n  and  basis  function  /?.  The 
HIGF  can  be  expressed  as  [31] 


9UR°)  =  j  dR,°  9nnf(Ra,K)  (15) 

where  n  and  /3  are  both  associated  with  the  same  arrangement  a,  and  the  radial 
Green’s  function  is  defined  as 


gnn'(Ra,R'a)  —  A°n„A°,n„  i 

-If  ' 


(r)/„V<(*a)(‘^n»(*'J 

{i)fZn"(R*)(r)ttn4R'a) 


Ra  <  R'a 
Ra>K  ’ 


(16) 


where  is  the  irregular  analog  of  In  addition  the  integrals  Fnn„, 

Gt In0,  Tnn.,  and  7>„0  are  given  by  the  expressions 


/  £„<  Kn'  (r)/"4(*°>n<n. (*».)> 

a  =  a„; 

(17) 

i  f  dRa  A“n,  ^n(RoW:.Z(Ra,Ra.), 

otherwise, 

j  5Zn<  A°°n,g%  /)(Ra,)en'n.(Ra.), 

~  \  JdRoZn'  ±an'n-9Ze(Ra)W™-'(Ra,Rae), 

a  =  a0\ 

otherwise, 

(18) 

(  A“°o  (r)/“;„(fia.  ), 

a  =  0co; 

(19) 

1  fdRa  A“n,  Mftn(Ra)B™-'(Ra,Ra.), 

otherwise, 

=  f  A^nJ^(Ra.), 

a  =  ar0; 

(20) 

|  fdRa  ^2n,  A°tn,g^,g(Ra)B°^(Ra,  Ra.), 

otherwise. 

The  integrals  (17-20)  contain  the  inner  two  quadratures  of  the  three-dimensional 
integration  mentioned  above.  For  integrations  over  functions  defined  in  two  different 
arrangements,  say  a  and  a0,  the  quadratures  are  carried  out  in  a  coordinate  system 
consisting  of  Ra,  Ra,,  and  the  angle  AOQo  between  the  vector  from  the  atom  to 
the  diatom  in  arrangement  a  and  the  analogous  vector  in  a„.  The  inner  loop 
involves  integration  over  Aaa.  and  yields  6“,“*^  or  C“,“*  ,  from  which  we  calculate 
[31].  The  various  middle  loops  are  given  in  Eqs.  (17-20),  and  the  various 
outer  loops  are  indicated  in  Eqs.  (10-14).  Similarly  e„>n.(Ra.)  is  defined  as  a  two- 
dimensioned  integral  over  internal  coordinates  orthogonal  to  Ra.  when  all  functions 
in  the  integrand  are  associated  with  the  same  arrangement  a0.  Further  details  of 
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the  quantities  in  Eqs,  (1-20)  are  not  necessary  for  the  discussion  in  this  paper — see 
Refs.  [31,42]  for  full  details. 

2.2.  Hermitian  Properties  of  the  Hamiltonian 

Let  us  consider  the  integrals  given  by  Eqs.  (17-20)  for  a  triatomic  system 
where  all  arrangements  are  different.  Now  the  matrices  K-B  and  C  are  symmetric 
while  B,  g,  and  2  a1®  rectangular.  For  the  symmetric  matrices  it  is  necessary  to 
just  compute  the  lower  triangle,  i.e.,  we  can  restrict  a  <  a0;  thus  only  3  of  the 
6  possible  reactive  arrangement  pairs  are  required.  However  at  first  glance  it  may 
seem  that  for  the  rectangular  matrices  it  will  be  necessary  to  use  all  6  of  the  reactive 
arrangement  pairs,  but  we  now  show  that  this  is  not  so.  That  is,  the  rectangular 
matrices  can  also  be  assembled  from  only  the  3  reactive  arrangement  pairs  required 
for  the  symmetric  matrices. 

To  derive  the  relation  we  require,  it  is  useful  to  express  the  matrix  elements  of 
the  rectangular  matrices  in  Dirac  notation  [31]: 


B0n.  =<  *,,\GSuaa.\rl>a°n‘  >=<  >, 

(21) 

Bnt.  =<  rn\Ua-G*>' |*,.  >, 

(22) 

and 

T>n0'=<  </>“"!«“•  1*5.  >, 

(23) 

where 

Uaa,  =  -(2/i/ft 2)(tf -E  +  6aa.(E  -  JiTf )], 

(24) 

Ua°  =  -(2 n/h2)[H  -  H°], 

(25) 

and  n  is  the  reduced  mass.  Now  for  inter-arrangement  integrals,  Uaao  is  just 
—(2 n/h2)(H  —  E),  which  is  Hermitian;  thus  Bince  B0na  is  real  it  can  be  written 

as 

B0n.  =<r-n-\Ua.aG2\*e>  . 

(26) 

Now  we  also  know  that  [31] 

W„.aG£  =UaG%  -l-Ma.o 

(27) 

thus  for  inter-arrangement  matrix  elements, 

B0n.  =<  4>a°n’\UaGZ\90  >-<  xl>a'n°\Ua\<t>g  >, 

(28) 

or 

&0no  =  &no0 

(29) 
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This  has  two  consequences.  First  of  all  Eq.  (9)  can  be  rewritten  as 

C  =  C  -  BX  -  XtBt  +  XTg  -  XT£BX, 

where  2  is  zero  unless  an  intra-arrangement  integral  is  involved,  in  which  case  it 
is  equal  to  2  (see  also  Ref.  [59]).  Thus  there  is  no  need  to  compute  and  store  the 
additional  matrices  S  and  V.  This  means  that  there  are  no  integrals  required  for 
the  complex  scattering  matrix  boundary  conditions  which  are  not  also  needed  for 
calculations  which  employ  real  reactance  matrix  boundary  conditions.  However  this 
is  not  the  only  advantage.  If  in  calculating  the  integrals  given  by  Eqs.  (17-20),  the 
restriction  a  <  a0  is  made,  then  there  is  enough  information  to  calculate  Bff„c  for 
a  <  a 0  and  the  matrix  elements  B„t ja  and  T>nffc  for  a  <  a0.  Then  by  using  Eq. 
(29),  one  can  obtain  Bgn<t  for  a  >  a0. 

Thus  it  is  not  necessary  to  evaluate  the  integrals  (17-20)  for  a  >  a„.  This 
is  an  important  simplification  because  the  calculation  of  these  integrals  of  Eqs. 
(11-14)  usually  requires  a  substantial  fraction  of  the  whole  time  taken  up  by  the 
integral  calculation.  Thus  for  systems  having  no  symmetry,  this  amounts  to  almost 
a  factor  of  2  decrease  in  the  work  involved,  since  only  3  out  of  the  6  possible  reactive 
arrangement  pairs  are  required.  For  systems  having  identical  atoms  this  factor  is 
smaller:  For  systems  of  the  type  A +  Bt,  i.e.,  where  B  is  the  same  kind  of  atom  as 
C,  it  is  necessary  only  to  perform  2  of  the  possible  3  unique  reactive  arrangement 
pairs,  while  for  A  +  A2  collisions,  no  savings  above  those  already  achieved  by  using 
arrangement  symmetry  [42]  are  possible  from  this  technique. 

2.3.  Localized  Basis  Functions 

In  this  section  we  show  how  the  calculations  simplify  if  <"„(/?„)  is  a  localized 
function.  Although  it  is  convenient  numerically  to  bypass  the  calculation  of  the 
irregular  function  in  Eq.  (16)  and  solve  for  the  HIGFs  directly,  it  is  necessary  to 
consider  the  expressions  (15-16)  for  the  HIGF  in  terms  of  the  regular  and  irregular 
distorted  waves  and  the  basis  functions  in  order  to  see  the  effect  of  localization. 

In  particular  suppose  that  the  basis  function  is  zero  for  and 

R'a  >  Rp.  Then  the  limits  on  the  integral  in  Eq.  (15)  reduce  from  the  original  0 
to  oo  to  the  computationally  more  attractive  Rg  to  Rg.  Thus  if  Ra  <  Rg,  then 
in  the  integral  of  Eq.  (15)  we  will  have  Ra  <  Rfa  for  all  Rfa,  and  the  HIGF  will 
be  equal  to  a  linear  combination  of  the  columns  of  the  regular  function.  Similarly 
if  Ra  >  Rj),  the  HIGF  will  be  equal  to  a  linear  combination  of  the  columns  of 
the  irregular  function.  Now  in  practice  we  wish  to  avoid  calculating  the  irregular 
function;  however  this  is  not  difficult  since  we  only  need  to  know  this  function  at 
large  distances,  and  for  large  distances  one  of  the  HIGFs  will  do  just  as  well.  We 
will  order  the  basis  functions  so  that  0  =  1, ...,  N  correspond  to  the  basis  functions 
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which  have  the  smallest  values  of  Rjj  for  each  ng.  Note:  N,  the  number  of  channels, 
is  less  than  M,  the  number  of  basis  functions.  Thus  we  can  write 

f  <r)fZn.(Ra)d°JgA°n,nt'  Ra  <  R$' 
rfg.{R«)  =  l  £„■  tfn'iR*)^. K'n>.  R°  >  Ri  (31) 

[  ffnff.(Ro)  otherwise, 

where  d%?gt  and  d®,V  are  proportionality  constants,  g%n,  is  the  HIGF  having  the 
smallest  value  for  Rjj,  and  jjfj  is  the  numerically  generated  function.  The  propor¬ 
tionality  constants  for  small  Ra  can  be  evaluated  via 

<%$  =  A“,n  f  dRa  <•>/“  n(R*K,n,(Ra)  (32) 

if  the  irregular  function  is  known,  or  more  practically  by  forming  the  ratio  of  the 
numerically  determined  HIGF  and  the  regular  distorted  wave.  We  form  the  average 
of  the  radial  functions  over  several  distances  immediately  prior  to  Rg  and  then  form 
the  ratio  to  obtain  d“|.  One  difficulty  with  this  procedure  arises  when  distortion 
potential  blocks  contain  both  open  and  closed  channels.  In  this  case,  some  of 
the  columns  of  <rl/“  are  so  small  n ear  the  origin  that  it  is  not  possible  to  obtain 
an  accurate  inverse;  then  the  procedure  fails.  This  problem  can  be  avoided  by 
decoupling  the  open  and  closed  channels. 

The  proportionality  constant  for  large  Ra  can  be  determined  by  a  numerical 
ratio  in  a  similar  manner  to  the  procedure  for  d°g  or  by  solving 

(33) 

n# 

where 

d?«  =  K,n  I  dR„  {r)K,n(Ra)tam,n,(Ra),  (34) 

and  d°,n  corresponds  to  the  integral  for  the  basis  function  having  the  smallest  value 
for  Rg.  The  integrals  in  Eq.  (34)  are  also  required  for  the  large- /?„  boundary 
conditions  for  the  HIGFs  [31]. 

Equation  (31)  has  several  consequences.  First  of  all,  it  is  clear  that  once 
and  are  known,  it  is  only  necessary  to  calculate  and  store  the  remaining  HIGFs 
in  the  ranges  Rg  to  Rjj.  In  practice  since  we  determine  the  $*[g  by  solving  an  inho¬ 
mogeneous  form  of  the  finite  difference  boundary  value  method  [31,42]  (FDBVM), 
we  can  reduce  the  finite  difference  grid  used  for  the  regular  function  and  the  first 
HIGF,  which  goes  from  R^i  to  R^sfF)'  *°  a  S^d  wh*ch  goes  from  R£ j  to  just 
beyond  the  maximum  value  of  Rg.  Because  R^n^F)  *s  determined  by  the  distance 
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where  the  potential  becomes  negligible,  whereas  the  maximum  value  of  Rg  is  deter¬ 
mined  by  the  distance  where  the  difference  between  the  potential  and  the  distortion 
potential  becomes  negligible,  this  can  result  in  a  considerable  savings.  The  most 
important  consequence  of  Eq.  (31)  however,  arises  from  applying  it  to  the  integrals 
in  Sect.  2.1. 

There  are  two  classes  of  integrals  to  consider.  First  of  all  there  are  the  radial 
integrals  with  one  HIGF  and  no  basis  functions,  which  we  will  approximate  by  a 
quadrature  sum: 

Nro 

=  E  (35) 

1=1 

where 

MZn  =  E  ^n'nt9%(Rci)Mn'n(Rai),  (36) 

n* 

wf  being  a  quadrature  weight  and  A/„„>  some  matrix  function.  The  integrals  falling 
in  this  case  are  Ggn.,  T$nc,  (for  which  the  transpose  of  Eq.  (35)  and  following 
are  used),  the  intra-arrangement  parts  of  Bgn ,,  and  the  inter-arrangement  parts 
of  Cgga.  It  is  useful  to  define  a  quantity  analogous  to  M|n,  called  M„n,  which 
differs  by  replacing  the  HIGF  with  the  regular  function.  Then  applying  Eq.  (31), 
the  integral  becomes 


/j„=EA^£(E  <.(*)]+  E  MJ»(o 

,=,i+i 

N*9 

+  EA&Mdtfl  E 

h  i=»f  + 1 

where  Rais  is  the  largest  quadrature  point  less  than  Rg  and  RaiL  is  the  largest 
quadrature  point  less  than  Rg.  (For  the  HIGF  with  the  smallest  Rg,  we  set  ijj  equal 
to  NRCi.)  It  should  be  noted  that  the  quantities  in  the  first  sum  must  be  calculated 
anyway — for  the  list  of  integrals  mentioned  after  Eq.  (36),  these  correspond  to 
T„„,,  £®„n,  £„!>,>  and  BgUo,  respectively.  Thus  we  see  that  the  radial  quadrature 
points  fall  into  three  regions:  those  less  than  all  ig  where  only  quantities  involving 
the  regular  function  need  be  accumulated,  those  greater  than  all  Rjj  where  only 
quantities  involving  the  regular  function  and  the  HIGFs  with  the  smallest  value  of 
ig  need  to  be  accumulated,  and  points  in  the  intermediate  region  where  the  stuns 
will  include  some  number  of  HIGFs  less  than  the  full  set. 
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Now  consider  the  case  when  there  are  two  HIGFs.  Here  the  integral  can  be 
written 

Ih.  =  £  (38) 

1=1 

where 


=  »f  £  K-n'K-n'.tt^RaiWn w(Rai)iS»fi.(Rai).  (39) 


The  integrals  falling  in  this  case  are  the  intra- arrangement  parts  of  C.  The  presence 
of  two  HIGFs  greatly  complicates  this  case  compared  to  Eq.  (37).  There  are  now 
six  cases  to  consider:  nonoverlapping  basis  functions,  one  basis  function  contained 
within  the  other,  and  other  overlapping  basis  functions,  with  the  bra  or  ket  starting 
first.  However  it  can  be  shown  that  the  result  of  using  Eq.  (31)  is 


>  *1 


J  E*A*.,.3fc(« iwi 

E,^^R,(.D 
,  J  E*  Ufh(^  R<J)  -  ,  *f  ,*fo  )]d?£„  i%  >  *£, 

\  Ea  A?n/?^[f?^(lV««)  -  /^(max(»£,t£,ifj))  >  ilf 


+  £  Mli(o 

!Or£i  i  L  s.  *  ^ 


(40) 


In  this  equation,  Ipn  corresponds  to  B,  and  an  argument  of  t  is  interpreted  as  the 
sum  up  to  the  ith  quadrature  point.  It  should  be  noted  that  for  non-overlapping 
basis  functions,  the  middle  sum  will  have  no  contribution.  In  addition,  this  formula 
applies  as  well  to  matrix  elements  consisting  of  one  HIGF  and  one  translational  basis 
function.  In  this  case,  d®£  and  d°jj  for  the  translational  basis  function  are  zero  and 
corresponds  to  J?.  Since  the  number  of  terms  in  the  middle  sum  is  much  less 
than  the  number  of  radial  quadrature  points,  NRQ,  using  Eq.  (40)  greatly  reduces 
the  work  required  for  evaluating  the  integrals.  It  should  be  noted  that  care  needs 
be  taken  to  avoid  excessive  round  off  error  when  evaluating  the  last  subtraction  in 
Eq.  (40).  Optimally  one  would  separately  store  the  contributions  from  the  various 
intervals  between  mm(i|,  i£^ )  and  max(i Jj  ,  »£,  t f, )  or  max(i£ ,  if,  if, ),  since  in  this 
way  no  explicit  subtraction  is  required. 

Another  way  to  exploit  Eq.  (31)  is  to  take  linear  combinations  of  the  HIGFs 
and  use  these  as  basis  functions.  Consider  the  HIGF  labeled  by  /?.  We  can  form 
the  combination 

9ng(Ro)  =  9n0(Ra)  -  £  9nn'(Ro)d^g,  (41) 

n» 
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in  which  case  the  analog  of  Eq.  (31)  for  g£g  becomes 

f  (r)fZn'(R°K?f  A“,n?  Ra  <  Rs 

9nfi(Rc)  =  |  o  Ra>R%  (42) 

(  gnfj(Ra)  otherwise, 

where  Rs  is  the  smallest  value  of  R„, ,  and  is  the  new  small  Ra  proportionality 
constant.  There  are  two  advantages  of  this  formulation.  First  of  all,  when  using  the 
gf-g,  Eqs.  (37)  and  (40)  simplify,  because  now  only  the  first  two  terms  sire  present. 
This  is  especially  importsmt  for  Eq.  (40),  because  one  now  avoids  the  finsd  subtrac- 
tion,  which  is  complicated  to  implement  in  a  manner  which  avoids  roundoff  error. 
The  second  sidvsintage  sirises  when  trsmsforming  to  complex  boundsiry  conditions. 
This  will  be  discussed  in  the  next  section. 

One  disadvantage  of  using  the  g%g  is  that  while  previously  the  first  sum  in 
Eqs.  (37)  sind  (40)  went  from  1  to  ig  or  min(«|,i|o),  it  now  goes  from  1  to  i5, 
where  Ra ^  is  the  lsirgest  quadrature  point  less  thsm  R5 .  Since  is  will  be  smsdler 
than  ig  or  min(ig,  ),  the  overall  efficiency  is  not  sis  great.  However,  one  can 
diminish  this  effect  by  not  using  in  Eq.  (41)  the  g^n,  which  axe  the  HIGFs  with  the 
smallest  values  for  large- zero  limit,  but  rather  using  HIGFs  which  have  sis  lsirge 
small- Ra  zero  limits  as  possible  subject  to  the  constraint  their  large- Ra  zero  limit 
is  not  lsirger  thsm  Rg.  This  will  maximize  is. 

We  now  consider  the  effect  on  the  matrix  elements  when  the  g£g  are  substituted 
for  the  g%g  in  Eqs.  (10-14)  smd  (17-20).  Since  Eq.  (41)  can  be  written  as 

8C=8NU  (43) 

where  the  matrix  elements  of  L  sue  the  constsmt  factors  in  Eq.  (41),  we  see  that 
provided  we  combine  the  in  Eq.  (14)  using  the  same  rule  sis  was  used  to 

produce  the  g~g,  then  we  obtain 

B£  =  LrB  (44) 

and 

Cc  =  LtCL  (45) 

when  using  the  g„g.  Provided  that  L-1  exists,  it  is  esisy  to  see  that 

BcrCc"‘Bc  =  BtC-1B,  (46) 

thus  it  is  not  necessary  to  transform  bsick  to  the  g^g.  In  order  to  ensure  that  L-1 
exists,  it  is  necessary  to  retain  at  least  one  g^g  per  chemnel. 


359 


Now  consider  the  transformation  from  real  to  complex  boundary  conditions, 
Eqs.  (8),  (9),  and  (30).  Since  the  are  strictly  zero  beyond  the  distance  where 
the  associated  localized  basis  functions  are  zero,  they  will  be  independent  of  the 
boundary  conditions.  That  is,  the  subblock  of  X  associated  with  g£s  will  be  the 
unit  matrix,  the  subblock  of  2  will  be  the  zero  matrix,  and  the  portions  of  Cc 
which  are  associated  with  two  of  the  g„s  will  be  real  and  the  same  as  Cc.  In  the 
next  section,  we  will  show  how  this  can  be  exploited  to  save  work  in  evaluating  Eq. 
(6). 

Thus  we  see  that  substantial  work  can  be  saved  in  evaluating  the  integrals  when 
localized  basis  functions  are  used.  We  now  consider  the  choice  of  such  functions. 
In  our  previous  work  using  variational  methods,  we  have  used  distributed  gaussian 
functions  [30-32,42,59-74]  or  sine-type  functions  [31]  as  a  basis.  Strictly  speak¬ 
ing,  the  gaussian  functions  are  not  local,  since  they  are  zero  only  asymptotically; 
however  for  practical  purposes  they  differ  significantly  from  zero  only  in  a  narrow 
region.  Thus  one  procedure  to  use  would  be  to  set  the  gaussian  to  zero  whenever  it 
fell  below  some  fraction  of  its  maximum,  say  10-14.  However  this  procedure  intro¬ 
duces  discontinuities  into  the  integrands  which  can  cause  slow  convergence  of  the 
numerical  integrals.  Thus  we  seek  a  localized  function  which  is  continuous  and  has 
continuous  derivatives.  The  function  we  will  use  is  inspired  by  the  cutoff  function 
of  Ref.  [75],  and  is  given  by 


/  exP[-i-(*;tp)  1*1  <  6 

\  0  |*|  >  b 


(47) 


We  call  this  function  a  cut  off  gaussian”  (COG).  It  has  the  property  that  for  small 
x/b,  it  behaves  like  exp[— a(*/6)2],  so  can  be  made  similar  to  our  previous  basis 
functions,  yet  it  is  localized  within  6  of  its  center.  It  should  be  noted  that  as 
a  — *  oo  with  fixed  a/b2,  the  COG  becomes  a  gaussian. 


2.4.  Partitioned  Matrices 


We  will  partition  the  matrices  into  blocks  consisting  either  of  functions  which 
are  localized  (the  g^p)  or  those  which  are  delocalized.  Thus  we  write 


and 


b£-(b0' 

it  _  (  CccCjj.  'N 
"  ~  VCceCcc,/’ 


(48) 


(49) 


where  the  subscript  C  means  localized  and  c  delocalized.  If  we  solve  the  matrix 
equation 


£  =  £B  +  BcrCc~'Bc, 


(50) 
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by  blocks,  we  obtain 


£  =  £b'  +  B/TC''1B/,  (51) 

where  the  folded  matrices  are  given  by 

£Bf  =  £a  -  BjC^Be,  (52) 

B/  =  Bc  —  CZcCziBc,  (53) 

and 

Cf  =  Ccc  -  CTCcC ZlCcc  (54) 

Now  consider  solving  Eq.  (6)  by  the  same  procedure.  The  result  is 

g  =  gBf  +  B/rCr’B/,  (55) 

where  the  complex  folded  matrices  are  obtain  from  Eqs.  (7),  (8),  and  (30)  using 
the  real  folded  matrices  of  Eqs.  (52-54).  A  similar  procedure  was  employed  in  the 
context  of  the  Kohn  variational  principle  in  Ref.  [36], 

Several  things  should  be  noted  concerning  the  above  procedure  as  it  affects  the 
GNVP  calculations.  First  of  all,  if  the  number  of  localized  functions  is  considerably 
larger  than  the  number  of  delocalized  functions;  as  may  often  be  the  case,  the  work 
to  produce  the  folded  matrices  will  be  greater  than  the  work  to  evaluate  Eq.  (55). 
This  means  that  the  computational  effort  involved  in  a  calculation  with  complex 
boundary  conditions  will  be  very  similar  to  what  would  have  been  required  if  real 
boundary  conditions  had  been  used.  Also  the  memory  requirements  will  be  similar, 
because  it  will  not  be  necessary  to  store  the  imaginary  part  of  C. 

3.  Applications  and  Directions  for  Future  Work 

The  techniques  presented  here  and  previously  provide  an  efficient  method  for 
large-scale  quantum  mechanical  calculations  of  chemical  reaction  dynamics  based 
on  the  generalized  Newton  variational  principle.  Some  applications  that  have  been 
made  include  the  calculation  of  converged  cross  sections  for  the  H  Hj  (63]  and 
D  4-H2  [32,71]  reactions  and  converged  state-selected  reactive  transition  probabilities 
for  these  reactions  [30,31,42,59,61,62,65,66.69]  and  for  the  O+H2  [59,72]  and  O+HD 
[59,67]  reactions.  We  have  also  presented  converged  reactive  transition  probabilities 
for  the  F  +  H?  reaction  with  total  angular  momentum  J  =  0-2  [64,68,70,73,74]. 
We  have  calculated  converged  collisional  delay  times,  which  require  ve-y  stable 
(numerically  differentiable)  solutions  as  a  function  of  energy,  for  H  +  H2  with  J  =  0, 
1,  and  4  [69],  for  D-f  H2  with  J  =  0  [65],  and  for  F+H2  with  J  =  0  and  1  [68,70],  An 
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earlier  nonvariational  version  of  the  method  was  used  to  obtain  converged  transition 
probabilities  for  the  H  4-  H2  [45],  D  +  H2  [45,76],  O  +  H2  [46,77],  O  +  OH  [46],  and 
H  -f  HBr  [78]  reactions. 

For  O  +  H2  we  have  obtained  very  well  converged  results  with  an  average  of 
as  few  as  three  gaussians  per  channel  [72].  A  recent  study  of  basis  set  requirements 
for  F  +  H2  showed  that  excellent  convergence  can  be  achieved  with  10  gaussians  per 
channel  in  the  F  +  H2  arrangement  and  18  gaussians  per  channel  in  the  H  +  H F 
arrangement  [73].  Further  efficiencies  can  be  achieved  by  using  better  basis  sets,  e.g ., 
by  basis  set  contraction  [31,74,79,80]  or  the  use  of  localized  basis  sets  as  described 
in  the  present  paper.  Another  promising  approach  is  based  on  the  reinterpretation 
of  the  GNVP  using  a  scattered  wave  variational  principle  [47-50].  This  allows  for 
hybrid  basis  sets  which  effectively  convert  some  of  the  integrals  over  G®U  a'G°,,  as 
in  Cf)pa ,  into  simpler  energy-independent  integrals.  All  these  approaches  arc  being 
explored  for  further  applications. 

4.  Summary 

We  have  introduced  new  techniques  to  reduce  the  work  required  in  applying 
the  generalized  Newton  variational  principle  to  three-dimensional  reactive  scatter¬ 
ing  calculations.  The  underlying  idea  behind  these  developments  is  to  minimize 
redundant  work  as  much  as  possible.  This  is  accomplished  in  two  wavs.  First  of 
all  the  fact  that  the  Hamiltonian  is  Hermitian  is  used  to  decrease  the  number  of 
inter-arrangement  integrals  which  must  be  calculated.  Even  for  a  system  with  no 
symmetry,  e.g.,  O  +  HD,  this  reduces  by  half  the  number  of  two  dimensional  in¬ 
tegrals  which  are  performed  before  the  final  integration  of  the  three-dimensional 
exchange  integrals.  Secondly  we  introduce  a  localized  translational  basis  set  which 
need  not  differ  significantly  from  our  previous  basis  functions  and  then  exploit  the 
effect  this  has  on  the  half-integrated  Green’s  functions  to  reduce  the  amount  of 
work  required  to  calculate  these  functions,  the  amount  of  storage  required  to  save 
these  functions,  the  amount  of  work  required  for  the  integials  over  these  functions, 
and  the  work  required  for  the  final  linear-equations  step  when  complex  boundary 
conditions  sure  used. 
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Software  Environments  for  the  Parallel  Solution 
of  Partial  Differential  Equations 

L.  Ridgway  Scott 
the  University  of  Houston 

Most  commercial  supercomputers  are  now  sold  as  multi-processor  systems,  and  recent 
performance  gains  have  come  primarily  from  the  additional  processors  rather  than  improve¬ 
ment  in  the  single-processor  performance.  New  systems  based  on  large  numbers  of  low-cost 
VLSI  processors  have  been  introduced  that  offer  supercomputer  performance  at  a  fraction 
of  the  cost  of  conventional  architectures.  However,  a  major  obstacle  to  achieving  either 
the  overall  performance  of  the  multi-processor  supercomputers  or  the  cost-effectiveness 
of  the  newer  “parallel  VLSI  supercomputers”  is  the  difficulty  of  programming  them.  Per¬ 
formance  gains  are  also  being  achieved  by  the  development  of  complex  algorithms  such  us 
multi-grid,  however  such  algorithms  are  often  avoided  even  on  conventional  architectures 
due  to  the  difficulty  of  programming  them.  We  are  developing  techniques  for  implementing 
highly  efficient  (and  theoretically  justified)  algorithms  for  scientific  computation  on  paral¬ 
lel  architectures  by  combining  expertise  in  both  computer  science  and  mathematics.  The 
objective  is  to  do  so  in  a  way  that  the  resulting  codes  are  both  efficient  on,  and  portable 
among,  a  wide  range  of  parallel  computer  architectures. 

Most  research  on  parallel  programming  languages  in  the  past  has  been  devoted  to 
programming  multiple  tasks  on  a  single  processor,  e.g.,  as  must  be  done  in  an  operating 
system  for  a  conventional  computer.  However,  new  performance  gains  are  now  expected 
from  parallel-cpu  computers  that  co-operate  on  a  single  task.  Programming  language  tech¬ 
niques  for  the  latter  environment  are  currently  being  studied  by  a  number  of  researches. 
We  have  proposed  a  language  construct  that  allows  code  on  one  process(or)  to  access  vari¬ 
ables  explicitly  (by  name  only)  that  are  “stored”  in  another  process(or).  Thus  it  allows 
one  to  program  a  distributed-memory  machine  as  if  it  has  a  common  address  or  name 
space.  This  technique  led  to  significantly  shorter  development  time  for  parallel  codes,  as 
well  as  improved  portability  (implementations  were  carried  out  on  both  the  NCUBE  and 
iPSC)  and  reliability.  We  have  found  it  possible  to  program  a  large  number  of  diverse  al¬ 
gorithms  quickly  and  to  obtain  excellent  performance.  We  are  extending  these  techniques 
to  include  the  use  of  advanced  programming  languages  which  allow  the  implementation  of 
abstract  data  types.  This  makes  coding  and  debugging  finite  element  and  finite  difference 
applications  much  faster  and  more  reliable. 
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COMPUTATIONAL  ASPECTS  OF  “FAST”  PARTICLE  SIMULATIONS 

CHRISTOPHER  R.  ANDERSON* 


Abstract.  In  many  particle  simulations  the  calculation  of  the  potential  (velocity  held,  force,  etc.) 
requires  0(N2)  operations  where  N  is  the  number  of  particles.  In  this  paper  we  describe  the  basic  ideas 
behind  three  methods  which  are  employed  to  reduce  this  operation  count  to  approximately  O(N).  We 
discuss  the  issue  of  parameter  selection  for  these  methods  and  present  some  computational  evidence  which 
demonstrate  the  importance  of  making  good  choices  for  the  method  parameters.  We  conclude  with  some 
opinions  about  the  relative  merits  of  the  methods. 

1.  Introduction.  The  purpose  of  this  talk  is  to  discuss  several  methods  for  reducing 
the  computational  time  required  to  carry  out  particle  simulations.  The  simulations  I  have 
in  mind  are  those  which  occur  in  a  wide  variety  of  physical  problems  -  plasma  physics, 
astronomy,  incompressible  fluid  flows,  etc.  In  rather  general  terms,  in  order  to  compute  the 
evolution  of  the  particles  in  these  simulations,  one  is  is  required  to  compute  a  potential, 
force,  or  velocity  which  every  particle  induces  on  every  other  particle.  If  one  computes 
this  interaction  explicitly  (using  a  formula  I  will  give  below)  then  the  computational  time  is 
proportional  to  0(N2)  where  N  is  the  number  of  particles.  I  shall  refer  to  this  explicit  calcu¬ 
lation  as  a  “direct”  method.  There  are  so  called  “fast”  methods  for  which  the  computational 
work  is  O(N)  (or  O(NlogN),  etc  ). 

Now  the  0(N2),  or  direct,  method  is  very  easy  to  implement,  it  takes  about  fifteen 
lines  of  Fortran  code.  It  is  also  relatively  simple  to  write  the  code  so  that  advanced  compu¬ 
tational  hardware  (vector  or  multiprocessor  units)  can  be  used  to  significantly  reduce  the 
computational  time.  On  the  other  hand  the  0(N),  or  fast,  methods  are  rather  difficult  to 
implement  (often  several  hundred  lines  of  code)  and  organizing  the  computation  so  that 
vector  or  multiprocessors  are  utilized  efficiently  is  a  challenging  problem.  These  fast  meth¬ 
ods  typically  have  parameters  which  must  be  specified  in  advance  of  the  computation.  A 
wrong  choice  of  these  parameters  can  lead  to  a  “fast”  method  which  takes  more  CPU  time 
than  the  direct  method.  Moreover,  there  are  several  fast  methods,  each  with  somewhat 
different  properties,  so  that  the  choice  of  which  method  to  use  can  be  difficult.  In  spite  of 
these  drawbacks  there  is  still  great  interest  in  the  0(N)  type  methods  -  primarily  because 
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the  benefit  of  using  such  a  scheme  (when  implemented  correctly)  can  be  enormous.  As  an 
example,  in  a  two  dimensional  vortex  calculation  the  running  time  for  a  direct  evaluation 
of  the  velocity  field  induced  by  N=31,000  particles  was  360  seconds,  while  a  fast  method 
took  only  4  seconds  [3]. 

This  reduction  in  running  time  is  sufficient  to  stimulate  a  bit  of  interest  in  these  fast 
methods,  and  so  I  intend  to  discuss  three  different  fast  methods.  While  these  do  not  exhaust 
the  set  of  methods  one  can  choose  from,  these  are  methods  I  am  familiar  with,  and  I  believe 
they  form  a  representative  sample  of  the  whole  class.  While  the  methods  may  appear  to 
be  very  different,  each  of  them  utilizes  tin  same  basic  principle,  and  I  will  “derive”  the 
methods  by  discussing  the  basic  principle  and  indicating  how  this  is  then  developed  into 
the  methods  completed  form.  This  discussion  will  be  rather  general,  and  one  should  consult 
the  references  for  the  precise  details.  The  general  information  is  not  without  value,  for  it  is 
through  this  general  understanding  that  one  can  appreciate  the  issues  related  to  parameter 
selection  and  other  implementation  details.  I  will  show  some  computational  evidence  which 
shows  why  one  should  be  concerned  with  parameter  selection  and  indicate  how  one  might 
choose  the  relavent  parameters  optimally.  Lastly  I  will  discuss  some  aspects  of  method 
selection.  Each  of  these  methods  take  considerable  time  to  implement  (or  to  just  learn  how 
to  run  efficiently)  and  so  the  issue  of  appropriate  method  selection  is  important. 

2.  Fast  Methods.  The  computational  task  in  the  particle  simulation  one  performs 
depends  on  the  type  of  simulation  -  for  vortex  calculations  one  computes  the  velocity 
field  induced  by  the  particles,  for  galaxy  simulations  one  computes  the  force  induced  by 
gravitational  attraction  etc.  Rather  than  deal  with  each,  I  will  discuss  fast  methods  for  the 
following  model  problem:  Given  N  charged  particles  at  locations  x,  with  strengths  k,  then 
the  goal  is  to  calculate  the  potential  where  <t>  is  a  solution  of 

N 

(2.1)  A<f>  = 

.= l 

and  6(z)  is  Dirac’s  delta  function  and  A  is  the  Laplacian.  (If  one  wants  forces  then  one 
computes  the  gradient  of  this  solution.)  f  will  work  with  the  two-dimensional  rase  since 
the  key  ideas  behind  the  methods  don’t  differ  much  when  one  goes  from  two  to  three 
dimensions  and  the  two  dimensional  case  is  easier  to  describe.  There  are  certain  efficiency 
issues  which  change  with  dimension  and  I  will  address  these  specifically.  1  am  also  not 
discussing  computations  in  bounded  domains.  This  feature  introduces  some  important 
complications,  but  it  does  not  influence  the  basic  structure  of  the  fast  methods,  and  so  1 
am  assuming  the  computation  is  carried  out  in  all  of  R2 

Often  in  simulations  one  uses  smoothed  delta  functions  (or  “blobs")  and  the  model 
problem  in  this  case  is  identical  to  (2.1)  but  we  solve 

N 

(2.2)  A*  =  £»,(*- r,)x, 

i=i 

where  n't  is  a  smoothed  delta  function  whose  support  is  contained  within  a  disk  of  c.  The 
choice  of*  and  c  is  important  for  a  simulations  accuracy,  but  not  particularly  important  for 
the  methods  used  to  accelerate  the  computation  of  (2.2).  I  make  the  assumption  that  the 
blobs  have  support  contained  within  disks  or  spheres  of  radius  e,  and  so  the  blob  functions 
can  not  be  completely  general. 

The  solution  4>  of  (2.1)  is  given  by 

(2-3)  *W  =  E£'og(l*-'.l) 
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or  for  (2.2) 


(24)  *(*)  =  E  £*<(*-*•) 

t=i 

where  <S>t  is  the  potential  induced  by  a  single  blob.  (This  can  often  be  calculated  explicitly 
by  solving  =  ^(x)  using  the  method  of  separation  of  variables.) 

From  (2.3)  or  (2.4)  it  is  clear  that  if  we  evaluate  the  solution  tj>  at  each  point  x, ,  i  = 
1,  ...n,  then  this  computation  requires  0(N2)  operations.  For  particle  simulations,  the  larger 
the  value  of  N  the  better,  and  so  there  is  great  interest  in  reducing  this  operation  count. 

A  common  ingredient  to  all  of  the  fast  methods  which  I  am  going  to  describe  consists 
of  representing  the  potential  induced  by  a  cluster  of  particles  by  a  single  computational 
element.  The  savings  in  computation  time  comes  about  because  this  new  computational 
element  typically  takes  less  work  to  evaluate  than  evaluating  the  potential  directly  for  the 
individual  particles.  The  “cost”  of  this  procedure  is  the  loss  of  accuracy  which  occurs  when 
one  represents  this  potential  by  the  new  computational  element.  The  implementation  of 
this  idea  is  perhaps  most  clearly  seen  in  the  multipole  method  [8].  Consider  a  cluster  of  M 
particles  contained  within  a  disk  of  radius  R.  (See  figure  1).  Outside  the  disk,  the  potential 
is  the  real  part  of  an  analytic  function  and  therefore  can  be  represented  as  the  real  part 
of  a  Laurent  expansion.  (Here  we  are  identifying  the  x-y  plane  with  the  complex  z  plane.) 
Thus,  we  have 


M  .  oc 

<t>  =  Re(£  —  log(z-  z,))  =  Re(xlog(z-  z0)+  52  •  --■  -)*) 

i=l  t=i  ~  z* 

where  zo  is  the  center  of  the  disk  containing  the  particles  and  k  and  a*  are  coefficients 
calculated  from  the  strengths  and  locations  of  the  particles  in  the  disk.  If  an  evaluation 
point  is  outside  a  disk  of  radius  2R  then  the  error  in  using  p-terms  of  this  Laurent  expansion 
is  0((J)P).  Thus,  given  some  desired  accuracy,  we  select  a  value  of  p  which  assures  this 
accuracy  and  use  the  p-term  truncated  expansion  to  represent  the  potential  induced  by  M 
particles,  i.e.  we  use  the  approximation 

M  ,  P 

<t>  =  Re  (E  oT  lo8(*  -  *i))  *  Re  (k  !og(z  -  zo)  +  E  J — —  )*  ) 

i^i2*  (z  -  z* 

This  finite  Laurent  series  is  the  new  computational  element,  and  the  cost  of  evaluating 
this  element  is  O(p)  compared  to  0 (Af)  which  would  be  required  to  evaluate  the  potential 
directly.  Since  the  number  of  particles  M  does  not  factor  into  the  error  estimate,  one  obtains 
computational  speedup  if  M  is  much  greater  than  p.  There  is  of  course  work  involved  in 
constructing  the  finite  expansion  -  but  this  only  requires  O(M)  computations  and  need  be 
done  only  once.  This  work  is  therfore  “shared”  when  there  are  multiple  evaluations  of  the 
potential. 

This  idea  of  collapsing  a  cluster  of  particles  to  another  computational  element  which 
is  easier  to  evaluate  can  take  other  forms.  In  particle  in  cell  methods  a  finite  difference 
grid  of  mesh  width  h  covers  the  the  cluster  of  computational  particles  and  any  desired 
evaluation  points.  (See  figure  2.)  One  then  assigns  values  of  charge  density  to  the  grid 
nodes  to  approximate  the  charge  distribution  of  the  computational  particles.  Typically  a 
given  particles  charge  is  assigned  to  a  small  number  of  nearby  grid  nodes.  To  evaluate  the 
potential  which  is  induced  by  these  particles  we  solve  a  discrete  Laplace  equation  with  the 
forcing  function  given  by  this  grid  charge  density.  The  result  is  an  approximation  to  the 
potential  at  all  of  the  grid  nodes.  To  evaluate  the  potential  at  points  between  the  grid  nodes 
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an  interpolation  formula  is  used.  Again,  the  work  to  evaluate  the  potential  once  the  grid 
charge  approximation  has  been  made  is  independent  of  the  number  of  particles  M .  In  this 
particle  in  cell  approach,  the  new  computational  element  is  the  combination  of  a  finite  set 
of  grid  values  (used  represent  the  charge  density  on  the  grid)  and  an  inversion  of  a  discrete 
Laplacian  followed  by  an  appropriate  interpolation.  (This  may  seem  peculiar  to  lump  the 
inversion  of  the  discrete  Laplacian  into  the  element,  but  this  is  appropriate  since  each  term 
in  the  multipole  expansion  really  corresponds  to  inverting  a  Laplacian.) 

There  is  the  question  of  accuracy,  and  this  is  what  distinguishes  the  two  particle  in  cell 
techniques  I  will  discuss.  In  the  first  type  of  method  one  selects  an  assignment  scheme, 
a  discrete  Laplace  approximation,  and  an  interpolation  scheme  and  this  determines  the 
method.  Charge  densities  are  transferred  to  a  grid,  a  discrete  Laplace  equation  is  solved,  and 
then  the  potential  is  interpolated  at  the  desired  evaluation  points.  There  is  no  restriction  as 
to  where  the  evaluation  point  is  with  regard  to  the  particles  which  give  rise  to  the  potential. 
There  are  numerous  choices  of  each  of  the  components,  essentially  leading  to  a  hierarchy  of 
methods  with  increasing  complexity  and  providing  increasing  accuracy.  1  am  not  going  to 
go  into  detail  about  the  different  possibilities,  one  can  find  many  of  them  described  in  [10], 
but  merely  note  that  all  of  these  procedures  have  an  accuracy  which  is  closely  related  to 
the  size  of  the  finite  difference  grid  which  is  used  to  cover  the  computational  domain.  The 
finer  the  computational  mesh  used,  the  more  accurate  the  results.  I  shall  refer  this  type  of 
method  as  a  PIC  (particle  in  cell)  method. 

Another  type  of  method  [1]  is  similar  to  the  type  describe  above,  but  with  two  distinct 
differences.  The  first  difference  is  that  attention  is  paid  to  the  location  of  the  evaluation 
point  with  regard  to  the  particles  which  give  rise  to  the  potential.  If  the  evaluation  point  is 
too  close  to  the  source  particles,  then  the  potential  from  the  grid  is  not  used  and  a  direct 
evaluation  is  performed.  The  motivation  for  this  comes  from  the  observation  that  the  error 
in  the  potential  of  the  particle  in  cell  procedure  is  greatest  near  the  source  particles  and 
diminishes  rapidly  away  from  the  source  particles.  The  second  aspect  is  the  use  of  a  method 
for  assigning  charges  to  the  grid  so  that  the  potential  which  is  obtained  after  the  inversion 
of  the  discrete  Laplacian  is  not  coupled  to  the  order  of  accuracy  of  the  discrete  Laplacian. 
(Essentially  the  assignment  scheme  consists  of  figuring  out  what  should  be  placed  on  the  grid 
so  that  after  the  inversion  of  the  discrete  Laplacian,  the  potential  values  have  an  accuracy 
which  is  independent  of  the  mesh  size).  These  changes  lead  to  a  method  for  which  the 
accuracy  is  relatively  insensitive  to  the  grid  size  and  is  determined  by  other  considerations  - 
for  example  the  size  of  the  region  in  which  the  direct  evaluation  is  used  as  well  as  parameters 
associated  with  the  charge  assignment  procedure.  Since  the  method  involves  correcting  the 
potential  in  regions  localized  about  the  source  points,  I  will  refer  to  this  technique  as  the 
method  of  local  corrections  (MLC).  (One  should  note  that  the  evaluation  of  the  potential 
from  grid  values  is  only  done  at  points  away  from  the  source  points,  a  region  in  which 
the  potential  is  analytic,  so  that  high  order  interpolation  formulas  can  be  used  which  take 
advantage  of  this  fact.  [1],  [12]) 

I  have  described  the  basic  idea  behind  three  fast  methods  -  the  multipole  method  uses 
truncated  Laurent  expansions  and  the  PIC  and  MLC  methods  use  potentials  induced  by 
grid  based  values,  to  approximate  the  potential  due  to  a  cluster  of  particles.  Of  course, 
there  is  a  bit  of  work  in  creating  a  complete  method  from  this  basic  idea.  For  the  particle 
in  cell  approaches  it  is  not  difficult  to  see  how  a  complete  method  is  formulated.  Since 
the  problem  of  finding  the  potential  is  linear,  one  considers  all  particles  together,  assigns 
their  charge  (mass,  vorticity,  etc.)  to  the  grid,  solves  one  discrete  Laplacian  and  then 
interpolates  the  resulting  potential  at  all  of  the  required  evaluation  points.  If  there  are 
local  corrections  to  be  done  (MLC  method),  then  for  each  evaluation  point  the  value  of 
the  potential  induced  by  nearby  particles  is  computed  by  formula  of  type  (2.3)  or  (2.4). 
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This  value  is  added  to  the  potential  at  the  evaluation  point  and  a  value  corresponding  to 
the  nearby  particles  contribution  to  the  interpolated  grid  potential  is  subtracted  from  the 
potential  at  the  evaluation  point. 

Efficiently  implementing  the  concept  behind  the  multipole  method  is  a  little  more  dif¬ 
ficult.  There  are  several  ways  one  can  exploit  the  finite  Laurent  expansions,  and  I  will  just 
discuss  one  of  them,  that  due  to  Greengard  and  Rokhlin  [8]  To  use  the  idea  effectively  a 
box  is  chosen  which  covers  the  computational  domain.  By  recursively  dividing  the  box  by 
factors  of  two,  a  nested  set  of  grids  is  constructed.  One  selects  a  finest  level  of  refinement 
m  so  that  there  are  2m  boxes  in  each  direction  at  that  level.  For  a  given  set  of  evaluation 
points  in  one  of  these  finest  level  boxes,  the  potential  at  these  points  is  formed  tom  a  di¬ 
rect  interaction  with  particles  in  the  nearest  boxes  added  to  the  potential  due  to  multipole 
expansions  associated  with  the  particles  in  a  hierarchy  of  surrounding  boxes.  The  further 
away  one  is  from  the  evaluation  points,  the  larger  the  box  used  to  construct  the  multipole 
expansion.  One  wants  to  use  multipole  expansions  with  the  largest  box  possible,  but  there 
is  the  constraint  that  the  disk  which  covers  the  source  particles  be  a  sufficient  distance 
from  the  evaluation  point  box  to  retain  accuracy.  As  an  example  of  the  multipole  structure 
which  is  used,  see  figure  3.  In  this  figure  the  hatched  circle  indicates  the  box  in  which  the 
evaluation  points  are  associated  with  and  the  other  circles  represent  multipole  expansions. 
The  center  of  the  circle  is  the  center  of  the  multipole,  and  all  the  particles  in  the  box  in 
which  the  circle  sits  are  used  to  form  the  particular  multipole  expansion.  It  is  clear  that 
as  the  evaluation  points  change  the  set  of  multipoles  which  are  used  to  approximate  the 
potential  changes.  However,  what  doesn’t  change  is  the  fact  that  some  subset  of  multipoles 
for  small  boxes,  the  next  larger  boxes,  etc.  is  always  used.  Thus,  in  the  implementation  one 
first  constructs  all  of  the  multipoles  associated  with  all  the  different  size  boxes.  The  next 
step  is  to  be  clever  in  the  evaluation  of  the  appropriate  members  of  this  hierarchy.  An  idea 
introduced  by  Greengard  and  Rohklin  is  to  have  an  equivalent  hierarchy  of  power  series 
expansions  to  accomplish  this.  Rather  than  go  into  details  about  how  this  is  done  (see  [2] 
or  [8]),  it  is  instructive  to  consider  how  the  potential  of  one  particle  is  “communicated”  to 
an  evaluation  point  via  this  network  of  multipoles  and  power  series  expansions. 

The  source  particle  in  figure  4  is  denoted  by  “x”  and  the  evaluation  point  is  denoted 
by  an  “o”.  The  existence  of  the  source  particle  causes  multipole  expansions  to  be  formed 
for  every  box  which  contains  that  particle  at  all  the  successively  coarser  levels.  Each  of 
these  multipole  expansions  is  indicated  by  a  disk.  These  expansions  are  formed  recursively, 
and  so  the  potential  that  the  source  particle  induces  is  first  represented  by  the  multipole 
expansion  on  the  finest  level.  The  potential  induced  by  this  expansion  is  then  in  turn 
represented  by  the  multipole  expansion  on  the  next  coarser  level,  etc..  Next  the  potential 
induced  by  the  coarse  level  multipole  expansion  is  represented  by  a  coarse  level  power  series 
expansion  which  is  associated  with  the  box  containing  the  evaluation  point.  (The  largest 
shaded  disk  in  figure  4.)  The  potential  induced  by  this  power  series  expansion  is  then  in  turn 
represented  by  a  a  power  series  expansion  at  the  next  finer  level,  etc.  until  the  potential  is 
finally  represented  by  a  power  series  expansion  associated  with  the  finest  level.  This  finest 
level  expansion  is  then  evaluated  to  obtain  the  potential  induced  by  the  source  particle  at 
the  evaluation  point.  Every  source  particle  “talks”  to  the  evaluation  points  through  this 
hierarchy  of  multipoles  and  power  series  expansions.  The  number  of  levels  which  are  used 
in  this  communication  process  depends  on  the  distance  between  the  particles  -  the  further 
away  the  source  and  evaluation  point  are,  the  more  levels  which  are  used.  An  important 
point  to  note  is  that  this  computational  structure  utilizes  elements  which  are  localized 
about  the  computational  particles  themselves.  In  this  communication  process,  multipoles 
and  power  series  expansions  associated  with  boxes  not  directly  above  any  source  particles 
ot  evaluation  points  are  not  used.  This  is  a  bit  different  from  the  particle  in  cell  methods 
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in  which  the  communication  network,  a  finite  difference  grid,  covers  the  whole  domain  and 
is  not  localized  about  the  particles. 

3.  Parameter  Selection.  The  operation  count  of  each  of  the  methods  described 
above  is  O(N)  for  the  multipole  method  and  0(N)  +  O(MlogM)  for  grid  based  methods  in 
which  an  MxM  grid  is  used.  Thus,  in  terms  of  this  type  of  asymptotic  operation  count, 
as  N  is  changed  the  distinction  between  the  methods  is  covered  up  by  the  “O”  symbol.  A 
comparison  in  efficiency  must  be  based  on  the  size  of  the  constant  in  the  asymptotic  work 
estimate,  and  this  constant  depends  on  a  whole  host  of  factors.  The  size  of  this  constant 
is  directly  influenced  by  several  of  the  parameters  which  must  be  specified  before  the  fast 
methods  can  be  used.  I  would  now  like  to  discuss  some  of  these  parameters,  and  indicate 
how  they  might  be  optimally  chosen.  It  is  only  after  such  choices  are  made  that  meaningful 
information  about  the  relative  efficiencies  of  the  methods  might  be  determined. 

In  the  standard  PIC  method,  one  must  select  a  charge  assignment  scheme, a  discrete 
Laplace  approximation,  an  interpolation  scheme,  and  a  grid  size.  The  finer  the  grid,  the 
more  work  involved  in  inverting  the  Laplacian.  The  more  accurate  the  interpolation  and 
assignment  schemes,  the  more  work  is  required  to  implement  them.  In  most  calculations  the 
method  of  assignment,  Laplace  inversion,  and  interpolation  is  not  changed,  and  so  one  has 
only  to  specify  the  grid  size  to  be  used  for  any  particular  calculation.  Since  the  accuracy  is 
tied  to  the  size  of  the  grid,  accuracy  considerations  should  dictate  it’s  choice.  Certainly  the 
coarsest  grid  should  be  chosen  for  which  the  calculation  retains  sufficient  accuracy.  What 
sufficient  accuracy  is,  is  difficult  to  determine.  This  is  really  a  question  about  the  sensitivity 
of  the  particle  simulation  to  inaccuracies  in  the  potential  (or  velocity,  or  force  etc.),  i.e.  it 
depends  on  the  particular  problem  under  consideration.  With  PIC  methods  it  is  often  said 
that  you  cannot  believe  any  result  about  particle  structure  which  occurs  on  a  sub-grid  scale 
-  you  should  not  use  such  schemes  if  the  phenomenon  depends  critically  on  particle  motion 
on  such  scales.  This  is  of  coarse  rather  conservative  advice,  since  it  fails  to  distinguish 
between  the  higher  order  and  lower  order  PIC  methods. 

We  are  fortunate  that  for  some  particle  based  methods,  there  is  a  convergence  proof 
which  explicitly  account  for  the  fact  that  a  PIC  method  is  used  [7].  The  central  idea  is  to 
show  that  using  a  PIC  method  introduces  an  implicit  smoothing.  One  then  shows  that  the 
particle  simulation  with  this  smoothing  converges  to  the  solution  of  the  equations  as  both 
the  number  of  particles  increases  and  the  grid  of  the  PIC  method  (and  hence  the  smoothing) 
tend  to  zero.  The  error  estimates  worked  out  in  this  proof  may  be  of  use  in  determining 
the  correct  grid  size  one  needs  to  obtain  accurate  solutions.  (These  estimates  can  also 
assist  in  developing  systematic  methods  for  analyzing  charge  assignment  and  interpolation 
schemes.)  The  fact  that  PIC  methods  are  in  some  sense  equivalent  to  smoothed  particle 
schemes  is  interesting,  for  this  indicates  that  if  one  is  using  the  smoothed  particle  approach, 
then  one  might  seriously  consider  a  standard  PIC  method.  One  can  conceivably  obtain 
similar  answers  and  at  the  same  time  have  a  computationally  efficient  method.  Of  course, 
with  a  PIC  method  it  is  difficult  to  control  the  type  of  smoothing  introduced,  but  recent 
work  by  Merriman  [11]  should  be  of  assistance  here.  (The  approach  of  Merriman  is  also 
useful  in  that  it  indicates  how  one  can  accurately  implement  smoothing  in  the  presence  of 
boundaries.) 

The  selection  of  parameters  is  rather  different  for  the  other  two  methods.  In  the  mul¬ 
tipole  method,  one  chooses  the  number  of  terms  p  in  the  finite  Laurent  and  power  series 
expansions.  This  choice  is  determined  by  the  desired  accuracy.  (Again,  what  this  should  be 
depends  on  the  problem  under  consideration.)  Next,  one  must  select  the  level  of  refinement 
used  when  the  computational  domain  is  decomposed  into  boxes.  Unlike  the  standard  parti¬ 
cle  in  cell  method,  the  size  of  the  grid  does  not  influence  accuracy.  However,  the  size  of  the 
grid  does  influence  the  computational  efficiency  of  the  method.  In  figure  5  we  show  the  CPU 


374 


time  required  by  an  implementation  of  the  multipole  method  [2]  on  a  serial  machine  (SUN 
Sparcstation)  for  the  evaluation  of  the  potential  of  N  randomly  distributed  particles.  Each 
of  the  curves  represents  the  computation  performed  with  different  levels  of  refinement.  The 
results  show  that  there  is  a  marked  difference  in  the  efficiency  of  the  method  depending  on 
the  level  of  refinement  used.  For  small  numbers  of  particles,  too  much  refinement  is  clearly 
undesirable.  This  fact  is  expected,  since  if  there  are  too  few  particles  per  box  (say  less  than 
p),  then  we  are  certainly  doing  more  work  by  using  a  p  term  expansion  to  represent  their 
potential. 

Choosing  the  right  level  of  refinement  thus  presents  something  of  a  problem.  What 
makes  the  problem  more  complicated  is  that  the  amount  of  work  depends  on  the  distribution 
of  particles.  In  figure  6  we  show  the  CPU  time  for  various  numbers  of  particles  when  they 
are  distributed  in  a  rectangle  with  a  ten  to  one  ratio.  One  sees  that  it  is  indeed  distinct  from 
the  case  of  particles  uniformly  distributed  in  the  square.  This  dependence  on  the  particle 
distribution  is  rather  discouraging,  since  it  therfore  seems  difficult  to  design  an  automatic 
procedure  which  would  minimize  the  computation  time.  Fortunately,  as  discussed  in  [2], 
it  is  observed  that  the  work  for  each  level  can  easily  be  computed  in  advance  and  the 
minimum  level  selected.  The  key  idea  is  to  “dry  run”  the  code,  and,  rather  than  carry  out 
all  of  the  required  operations,  increment  counters  based  on  particle  density  and  the  number 
of  terms  in  the  expansions.  Using  the  value  of  these  counters,  the  running  time  can  be 
well  estimated.  Due  to  the  recursive  nature  of  the  method,  this  is  very  easy  to  implement, 
and  one  can  obtain  quickly  very  good  estimates  of  the  running  times  of  the  multipole  code 
for  several  refinement  levels.  With  estimates  for  the  work  required  by  the  various  levels 
of  refinement,  one  just  selects  the  level  with  the  least  amount  of  work  and  carries  out  the 
computation.  This  strategy  works  well  and  the  actual  CPU  time  taken  by  the  program 
forms  a  lower  envelope  for  the  timing  curves  in  both  figures  5  and  6. 

The  situation  for  the  particle  in  cell  method  with  local  corrections  is  analogous  to  the 
multipole  code.  A  certain  amount  of  accuracy  is  specified,  and  this  dictates  the  choice 
of  parameters  associated  with  the  charge  assignment  and  interpolation  scheme.  Once  the 
accuracy  is  specified  one  must  select  the  size  of  the  grid.  In  figure  7  we  show  the  results  of 
timings  for  different  size  grids  and  given  level  of  accuracy.  (This  was  a  velocity  calculation 
computed  on  a  Cray  XMP  [3]  -  timings  for  a  potential  calculation  would  be  analogous.) 
Again,  one  should  determine  the  optimal  grid  size  to  minimize  the  computational  effort. 
As  before,  this  can  be  done  by  “dry  running”  the  code  and  computing  timing  estimates 
based  on  particle  densities.  Rather  than  compute  these  estimates  for  all  possible  grid  sizes, 
it  is  computed  for  three  grid  sizes  and  the  minimum  of  the  quadratic  interpolant  is  used  to 
determine  the  optimal  size. 

If  the  asymptotic  estimates  are  scaled  to  represent  true  CPU  time,  then  the  constants 
appearing  in  them  will  depend  on  on  the  particular  hardware  at  hand.  Certainly,  there  is 
much  to  be  gained  if  advanced  computational  hardware  can  be  utilized  effectively.  I  have 
not  looked  into  this  aspect  in  great  detail  (other  than  implement  the  codes  I  work  with  on 
a  vector/multiprocessor  machine)  but  I  am  aware  that  it  is  not  a  simple  matter  to  optimize 
any  of  these  algorithms  to  take  advantage  of  special  hardware.  For  information  about 
implementing  PIC  type  methods  on  a  parallel  and  vector  processors,  you  might  have  a  look 
at  [4]  and  [5],  while  for  a  discussion  of  some  of  the  difficulties  involved  in  implementing 
multipole  type  methods,  one  might  consult  [9]  or  [6]  One  of  the  problems  which  arises 
in  taking  advantage  of  vector/multiprocessor  hardware  is  that  the  parallel  or  vectorization 
technique  must  be  dynamically  determined  to  reflect  the  changing  nature  of  the  interaction 
between  the  particles. 

4.  Con  lusions.  As  I  said  earlier,  the  three  methods  I  have  spoken  of  by  no  means 
exhaust  the  set  of  possible  methods,  but  I  think  they  form  a  representative  sample.  One  is 
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of  course  interested  in  specific  recommendations  for  method  choice.  If  I  were  just  starting 
out,  I  would  use  the  direct  method.  It  is  very  easy  to  program  and  you  can  be  assured 
of  the  accuracy  of  the  computation.  After  a  true  need  for  a  faster  method  has  developed, 
the  next  choice  is  between  a  grid  based  method  and  a  multipole  based  scheme.  For  two 
dimension  e,  it  is  not  clear  to  me  that  one  of  the  methods  is  substantially  superior  over  the 
other.  It  is  true  that  controlling  the  accuracy  is  easier  when  using  a  multipole  method  -  i.e. 
it  essentially  depends  on  one  parameter,  p,  the  number  of  terms  used  in  the  expansions. 
For  grid  based  schemes,  accuracy  is  determined  by  several  factors  and  it  takes  more  work 
to  find  the  correct  combination  to  achieve  some  level  of  accuracy.  Aside  from  this  accuracy 
consideration  any  specific  choice  should  probably  be  determ.. led  by  other  factors  -  such 
as  the  difficulty  of  implementation,  the  ability  to  exploit  specific  computational  hardware, 
the  ease  of  incorporating  boundary  conditions,  etc.  For  three  dimensions,  there  may  be  a 
reason  to  choose  one  method  over  the  other,  but  the  choice  depends  on  particle  distribution. 
If  the  particles  are  uniformly  distributed  then  either  a  multipole  method  or  a  grid  based 
scheme  would  be  acceptable.  If  the  particles  are  not  uniformly  distributed,  then  there 
may  be  good  reason  to  select  a  multipole  method.  As  mentioned  earlier,  the  network  of 
computational  elements  in  a  multipole  method  are  localized  about  the  particle  locations, 
and  so,  if  the  particles  are  distributed  in  a  lower  dimensional  fashion  (i.e.  along  a  line  or 
surface)  then  the  computational  network  is  also  of  lower  dimension.  This  can  mean  great 
savings,  both  in  computation  time  and  in  storage.  A  grid  based  method  takes  no  advantage 
of  the  lower  dimensionality  of  the  particle  distribution  -  the  computational  structure  fills  out 
the  complete  box  in  which  the  particles  reside.  This  introduces  a  significant  computational 
overhead  into  the  grid  based  method.  Of  course  if  one  uses  an  adaptive  grid  method,  then 
this  observation  must  be  modified. 
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Figure  2 

The  potential  induced  by  particles  in  the  disk  of 
radius  R  is  evaluated  by  interpolating  a  solution  of 
the  discrete  Laplacian.  The  right  hand  side  for  the 
discrete  Laplacian  is  constructed  by  assigning  the 
charge  of  the  particles  to  the  finite  difference  mesh. 


Figure  3 


Circles  indicate  multipole  expansions  used  to  compute  tne 
potential  at  the  evaluation  points  in  the  shaded  circle 


E'lgure  4 


The  multipole  (plain  disks)  and  power  series  expansions 
(dotted  disks)  used  to  communicate  the  potential  induced 
by  the  particle  at  x  to  the  point  at  o. 
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Figure  5 

CPU  time  (in  seconds)  vs.  number  of  vorticies  for 
particles  uniformly  distributed  in  a  unit  box. 

Each  curve  represents  computation  time  for  a  refinement 
of  space  into  4Alevel  boxes. 
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Number  of  Vortices 
Figure  6 

CPU  time  (in  seconds)  vs.  number  of  vorticies  for 
particles  uniformly  distributed  in  a  rectangle  with  an 
aspect  ratio  10  to  1.  Each  curve  represents  the 
refinement  of  the  surrounding  square  into  ^''level  boxes. 
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mesh  size  in  each  direction 
Figure  7 


CPU  time  (in  seconds)  vs.  size  of  mesh  in  each  direction. 
Different  curves  correspond  to  differing  number  of  vorticies 


SEMICONDUCTOR  MODELLING  VIA 
THE  BOLTZMANN  EQUATION1 


P.  DEGOND2 
F.  DELAURENS2 
FJ.  MUSTIELES2 


"9-th  International  Conference  on  Computing  Methods  in  Applied  Sciences  and  Engineering" 
Paris  (France),  January  29  to  February  2,  1990 


ABSTRACT  :  This  paper  is  devoted  to  the  presentation  of  a  new  numerical  method  for  the  simulation  of  the 
Boltzmann  Transport  Equation  of  semiconductors,  the  weighted  panicle  method.  A  detailed  presentation  of  the  method 
can  be  found  in  [1,2]  and  its  mathematical  analysis  has  been  performed  in  [3].  In  this  paper,  we  will  describe  the  kinetic 
model  of  the  Boltzmann  Equation  and  present  the  numerical  method  that  we  propose.  We  deal  with  two  different  cases : 
first,  an  homogeneous  case,  and  then  an  inhomogeneous  one,  where  one  has  to  solve  a  coupled  Boltzmann-Poisson 
system.  The  numerical  study  of  this  latter  case  has  been  done  in  [4],  and  is  detailed  in  [5]. 


1 .  INTRODUCTION 

Most  of  the  numerical  simulations  of  semiconductor  devices  use  the  drift-diffusion  model  [6,7]. 
This  model  is  based  upon  Ohm's  law  (for  drift)  and  Fields  law  (for  diffusion)  and  states  that  the 
average  velocity  of  the  carriers  is  proportional  to  the  electric  Field  ;  the  proportionality  coefficient  is  a 
field  dependent  mobility.  This  relation  is  obtained  at  equilibrium,  as  a  consequence  of  the  balance 
between  the  free  acceleration  of  the  carriers  and  their  diffusion  by  the  defects  of  the  crystal  lattice.  The 
time  needed  for  this  equilibrium  to  be  reached  is  the  momentum  relaxation  time  (mean  time  between 
collisions),  so  that  Ohm's  law  is  valid  as  long  as  this  relaxation  time  is  shorter  than  the  time  needed 
for  the  carriers  to  cross  the  device.  But,  in  submicron  devices,  some  carriers  have  almost  collisionless 
(or  ballistic)  flights,  and  thus  the  average  velocity  can  be  higher  than  Ohm's  law  predicted  value 

1  This  work  has  been  partially  supported  by  the  "Centre  National  d'Eludcs  dcs 
Tdldcommunications",  under  gram  n°  878B087  LAB/lCMyTOH,  and  by  the  "Direction  dcs 
Etudes  et  Recherchcs  Techniques",  under  grant  n°  87/283. 

2  Centre  de  Malhdmatiques  Appliqudes ;  URAD  CNRS  n°  756 
Ecole  Polytcchnique  -91128  Palaiscau  Ccdcx  -  FRANCE 
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[6,8].  In  fact,  the  drift-diffusion  model  does  not  take  into  account  the  main  features  of  transport  in 
submicronic  devices  [9]  :  the  presence  of  ballistic  carriers,  the  large  proportion  of  high  velocity 
("hot")  carriers,  and  the  large  gradients  of  carrier  density  and  temperature. 

For  the  simulation  of  hot  electron  effects,  a  more  involved  hydrodynamic  model  has  been 
proposed  by  many  authors  [8,10,11].  It  consists  of  conservation  equations  for  the  mass,  momentum 
and  energy,  and  is  deduced  from  the  Boltzmann  Transport  Equation  by  the  moment  method,  under 
the  assumption  that  the  distribution  function  of  the  carriers  is  a  drifted  Maxwellian.  Scattering 
processes  are  accounted  for  by  empirically  defined  relaxation  times  for  momentum  and  energy  at  the 
right  hand  side  of  these  conservation  equations.  Some  other  modifications  [12,13]  incorporate 
thermal  effects.  Nevertheless,  this  model  hardly  describes  the  ballistic  and  hot  electron  effets,  and  no 
completely  satisfactory  hydrodynamic  model  seems  to  be  available  yet. 

The  kinetic  model  (the  Boltzmann  Equation)  then  seems  to  give  the  most  accurate  description  of 
the  physics  attainable  by  numerical  computations.  In  this  paper,  we  will  recall  the  main  features  of  the 
semiconductor  Boltzmann  equation,  and  describe  the  weighted  particle  method.  We  will  provide  the 
results  of  numerical  simulations  in  two  different  cases.  Fust,  the  homogeneous  field  model  provides  a 
nice  framework  for  the  validation  of  our  method  particularly  concerning  the  deterministic  treatment  of 
the  collision  term  [1,2],  The  physical  situation  is  that  of  an  infinite  sample  of  semiconductor 
imbedded  in  a  uniform  electric  field.  This  model  also  provides  the  stationary  velocity  and  energy  as  a 
function  of  the  applied  electric  field,  as  well  as  the  relaxation  times  for  energy  and  momentum  [  14] , 
which  are  useful  in  a  hydrodynamical  model.  Second,  we  turn  to  the  simulation  of  a  one-dimensional 
inhomogeneous  structure,  which  requires  to  solve  a  coupled  Boltzmann-Poisson  system  [4,5],  For 
more  details  about  the  model,  the  numerical  method  in  both  the  homogeneous  and  inhomogeneous 
cases,  we  refer  the  reader  to  [1,2, 4, 5] .  The  numerical  analysis  of  the  method  in  the  homogeneous 
case  has  been  performed  in  [3] .  For  a  more  detailed  physical  description  of  the  kinetic  model,  we 
refer  the  reader  to  [6,15,16]. 


2 .  THE  SEMICONDUCTOR  BOLTZMANN  EQUATION 

We  will  suppose  that  the  electrons  are  the  only  charge  carriers  in  the  device.  This  is  a  reasonable 
assumption  for  many  N -doped  unipolar  devices.  The  Boltzmann  equation  for  the  electron  distribution 
function  f(x,k,t)  (where  x  is  the  position,  k  the  wave- vector  and  t  the  time)  is  written  in  the 
following  way : 


9,  f  +  v(k).Vx  f  -  (q/R)E(x,t).Vk  f  =  Q(f)(x,k,t) 
x  €  £2cR3;ks  B  c  R3  ,  t  £  0  . 
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In  the  equation  (1),  £2  stands  for  the  device  geometry  and  B  for  the  first  Brillouin  zone,  q  is  the 
positive  elementary  charge  and  K  the  reduced  Planck  constant.  The  field  v(k)  is  given  and  provides 
the  velocity  versus  wave-vector  relationship  for  an  electron  in  the  semiconductor  material  Q(f)  is  the 
scattering  operator  describing  the  interactions  of  the  electrons  with  the  lattice  defects.  The  electric  field 
E(x,t)  is  related  to  the  electron  density  n(x,t)  via  Poisson's  equation  : 


Efrt.t)  =  -  V0(x,t) 

(2) 

-A0{x,t)  =  |  (nc(x)  -  n(x,t)) 

(3) 

n(x,t)=  f  f(x,k,t)  ps  dk 

(4) 

e  and  pg  arc  respectively  the  permittivity  of  the  material  and  the  density  of  states  in  the  k-space,  and 
have  known  values ;  nD(x)  is  a  given  doping  profile  resulting  from  the  fabrication  technology. 

For  quantum  mechanical  reasons,  it  may  happen  that  one  needs  to  distinguish  between  several 
species  of  electrons  which  are  characterized  by  different  (effective)  masses  and  thus,  different 
functions  v(k) .  These  different  species  are  called  valleys.  For  instance,  in  Gallium  Arsenide,  three 
valleys  have  to  be  considered,  the  F,  L  and  X  valleys.  In  this  case,  the  model  consists  of  as  many 
distribution  functions  as  valleys,  each  of  them  solving  its  own  Boltzmann  equation.  These  different 
equations  are  coupled  in  two  ways  :  first,  via  an  "intervailey"  collision  operator,  of  a  similar  form  as 
the  "intravalley ”  one  (see  below),  second,  via  Poisson  equation.  We  refer  to  (1,2, 5,6,9]  for  more 
details.  For  the  sake  of  simplicity,  we  will  describe  the  kinetic  model  and  the  weighted  particle 
method  for  a  single  valley  model,  and  thus,  for  a  single  distribution  function. 

The  equations  (l)-(4)  must  be  supplemented  by  initial  and  boundary  conditions : 

f(x,k,0)  =  fo(x,k) 
f(0,k,t)  =  go(k,t)  for  v(k)  £  0 
fl[L,k,t)  =  giik.t)  for  v(k)SO 

4(0, t)  =  <t>o(t)  ,  <KM  =  <Wt) 

with  f0,  g(,,  gL,  0O  and  0L  suitably  given. 

The  integral  scattering  operator  Q(f)  is  written : 
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Q(f)(xjc,t)  = 

[  (7) 

=  I  [S{xjc',k)fi(x,k',t)(l-f(xjc,t))-  S(xjcjc’)^xjc,t)(l-f(xjc’,t))0  dk’ 

S(x,k,k')  are  known  transition  rates  depending  upon  the  physical  nature  of  the  involved  scattering 
processes.  The  (1-0  factors  originate  from  Pauli’s  exclusion  principle  and  make  Q  a  non  linear 
operator.  For  some  examples  of  transition  rates,  we  refer  the  reader  to  [2,5,6,17].  The  overall 
transition  rate  may  be  written  in  the  form : 

S(x ,k,k')  =  ]£  <|(x ,k,k')  8(£(k ')  -  e(k)  ±  E  a>p)  (8) 

The  sum  is  to  be  taken  over  +  and  -,  respectively  standing  for  the  emission  and  the  absorption  of  a 
phonon  of  energy  Eoip  by  an  electron,  and  over  all  the  possible  scattering  mechanisms.  The  Delta 
function  accounts  for  the  conservation  of  the  energy  of  the  electron/phonon  system  during  the 
collision.  The  function  $(x,k,k')  depends  upon  the  nature  of  the  scattering  mechanism. 

The  coupled  system  of  Boltzmann  equation  (1)  and  Poisson's  equation  (2,3,4)  is  non  linear  and 
induces  collision  damped  plasma  oscillations  of  high  frequencies.  In  the  practical  situations,  the 
doping  profile  function  nD(x)  has  very  steep  gradients.  The  overall  problem  is  a  high  dimension  stiff 
problem. 


3 .  THE  NUMERICAL  METHOD  :  GENERAL  PRESENTATION 

The  most  widely  spread  numerical  method  for  solving  the  semiconductor  Boltzmann  equation  is 
the  Monte-Carlo  method  (cf.  [6]  and  references  therein),  although  other  methods  have  been  used  in 
particular  geometries  (cf.  Reed's  method  [18])  or  for  particular  collision  operators  (see  the  recent 
method  developped  by  Baranger  [9]  or  Kuivalainen  and  Lindberg  [  19]).  The  Monte-Carlo  method  is 
quite  noisy  and  thus,  the  affordable  number  of  particles  is  generally  not  sufficient  to  get  a  shtrp 
resolution  of  die  distribution  function,  by  statistical  average.  The  moments  of  the  distribution  function 
such  as  the  current  and  energy  densities  can  be  recovered  with  a  sharp  resolution,  but  only  through 
time  averages  which  make  the  description  of  the  transient  regimes  uneasy.  The  new  method  and  the 
new  algorithms  which  we  will  describe  in  this  paper  are  somehow  derived  from  the  Monte-Carlo 
method,  but  are  intended  to  provide  a  more  accurate  numerical  approximation. 

The  weighted  particle  method  was  first  introduced  by  G.H.  Cottet,  S.  Mas-Gallic  and  P.A. 
Raviart  [20,21],  for  viscous  perturbations  of  the  incompressible  Euler  equation.  Then  the  method  was 
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adapted  to  the  treatment  of  collision  terms  in  kinetic  equations  [22],  Its  first  application  to  the 
semiconductor  Boltzmann  equation  has  been  done  in  [1,2]  and  an  error  analysis  relevant  to  this 
particular  physical  context  has  been  performed  in  [3].  In  the  deterministic  particle  method,  the 
particles  move  along  the  characteristics  of  the  convective  (first  order  differential)  part  of  the  equation, 
while  the  collision  term  is  taken  into  account  through  the  variation  of  the  weights  of  the  particles.  The 
collision  integral  is  evaluated  by  a  discrete  quadrature  where  the  particles  themselves  play  the  role  of 
quadrature  points. 

The  weighted  particle  method  is  based  upon  the  approximation  of  the  distribution  function  by  a 
sum  of  Delta  measures : 

N 

f[x,k,t)  e:  fh(x,k,t)  =  £  fiW  5(x-*i(t))  ®  8(k  -kj(t))  (9) 

i=l 

X;(t),  kj(t),  fj(t)  and  oa^t)  are  respectively  the  position,  the  wave-vector,  the  weight  and  the  control 


volume  of  the  particle  i ;  they  evolve  in  time  according  to  : 

^  =  v(kO  x£0)  =  xf  (10  a) 

^i=-aEi(t)  kj(0)  =  kf  (10  b) 

=  Q.(')  fi(0)=f?  (11) 

C0|(t)  =  CO?  (12) 

where  E;(t)  and  Q;(t)  are  respectively  the  approximations  of  the  electric  field  and  of  the  collision 
operator  acting  on  the  i-th  particle.  The  initial  x?,  k?,  fj1  and  are  chosen  so  that : 

fo(x.k)  =  |;co?  f?  6(x-x?)®6(k-kf)  (13) 

i=l 


The  time  differential  system  (10,11)  can  be  solved  by  any  classical  scheme.  In  our 
computations,  we  used  either  the  order  2  Adams-Bashforth  scheme  or  a  mixed  Adams-Bashforth  and 
backwards  Euler  scheme. 

To  define  Q((t),  we  introduce  a  cut-off  function  ^(x)  such  that : 
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(14) 


Wx)=i^a)  ;  «x)  =  «'x)  S  | 

where  C  is  a  compactly  supported  function.  We  delocalize  the  integral  operator  Q(f)  in  position 
using  this  ^  function,  and  then  we  perform  a  numerical  quadrature  using  die  particles  as  quadrature 
points.  Therefore  we  let,  omitting  the  t-dependence  of  Xj,  kj  and  fj  (see  [4]  for  details) : 

N 

Oil)  =  I  [Spfxj.kj.ki)  f/l-  fi)  -  Sp(xi,ki,kj)  fil -  fj)]  W*j  -Xi)  ©j  (15) 

j=i 

The  transition  rates  Sp  are  smooth  regularizations  of  the  transition  rates  S  given  by  (8) ;  they  are 
written : 

Sp(x,k,k')  =  X  (Xx.k.kO^kV^ilicop)  (16) 

where  is  a  compactly  supported  function  defined  in  a  similar  way  as  ^  .  The  computation  of 
Ej(t)  will  be  detailed  in  section  5. 


4 .  THE  HOMOGENEOUS  CASE 

Throughout  this  paragraph,  we  will  suppose  that  the  electric  field  is  an  external  electric  field 
denoted  by  E  which  is  uniform  (in  space)  and  constant  (in  time).  We  assume  that  the  problem  is 
homogeneous  in  space  and  that  nD(x)  =  nD  is  independent  of  x.  Thus,  the  dependence  upon  x 
vanishes  as  well  as  the  coupling  with  the  Poisson  equation.  We  assume  the  axisymmetry  of  the  wave- 
vectors  with  respect  to  the  field  axis,  and  describe  a  wave-vector  k  by  its  two  components  k,  and 
kj,  respectively  parallel  and  perpendicular  to  the  field.  Equations  (9,10  b.l  1,12,13,15)  describe  the 
weighted  particle  method  in  this  particular  case. 

Figure  1  shows  a  comparison  between  our  method  and  the  Monte-Carlo  method  [17],  for  a 
homogeneous  sample  of  Gallium  Arsenide  doped  at  nD  =  5. 1015  impurities  per  cm3 ,  at  temperature 
T  =  77  K,  imbedded  in  a  constant  electric  field  E  =  10  kV/cm.  The  band  diagram  of  Gallium 
Arsenide  was  described  by  a  standard  three  valley  model,  and  the  integral  operator  was  of  the  form 
(7,8),  with  about  40  different  scattering  mechanisms  [2].  Figure  1  displays  the  mean  velocity,  mean 
energy  and  density  versus  time  and  the  results  show  a  very  good  agreement  between  our  results  and 
the  Monte-Carlo  method.  Figure  2  displays  the  three  dimensional  views  of  the  distribution  function, 
during  its  time  evolution.  With  the  homogeneous  model  we  can  compute  the  stationary  characteristics 
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of  Gallium  Arsenide  [14].  Figure  3  shows  the  stationary  mean  velocity  and  mean  energy  versus 
electric  field  and  figure  4,  the  momentum  and  energy  relaxation  times  versus  energy.  These 
informations  are  useful  for  hydrodynamic  simulations.  Finally,  we  mention  that  the  homogeneous 
field  model  can  also  be  used  to  describe  the  bidimensional  transport  of  electrons  parallel  to  a 
heterojunction  interface  [23]. 


5.  AN  INHOMOGENEOUS  CASE 

In  this  paragraph,  we  concentrate  on  a  one-dimensional  inhomogeneous  case,  for  which  the 
solution  of  the  coupled  Boltzmann-Poisson  system  is  necessary.  More  details  can  be  found  in  [4,5]. 
In  this  inhomogeneous  case,  the  model  consists  of  the  Boltzmann  equation  (1)  and  of  the  Poisson 
equation  (2,3,4).  The  mutual  Coulomb  interaction  between  charged  particles  is  fully  taken  into 
account.  The  electric  field  has  a  constant  direction,  and  we  use  an  axisymmetric  geometry  relative  to 
its  axis,  like  in  the  homogeneous  case.  We  used  the  following  doping  profile,  already  used  by 
Baranger  in  [9] : 


no(x)  =  N+  for  0  <1  x  £  xi 
=  N"  for  xi  <  x  <  X2 

=  N+  forx2SxSL 

with  N'  =  2.1015  cm'3,  N+  =  1018  cm'3,  L  =  1,2  pm,  x,  =  0,4  pm  and  x2  =  0,8  pm.  The 
behaviour  of  this  N+-N-N+  structure  is  dominated  by  the  dynamics  of  the  carriers  in  the  N"  region. 
However,  a  sharp  numerical  description  of  this  region  is  not  easy  because,  due  to  the  large 
inhomogeneities  (N+/N'  =  500),  the  numerical  errors  on  the  electron  density  in  the  N+  region  are  of 
the  same  order  of  magnitude  as  the  density  itself  in  the  N'  region.  Moreover,  if  the  trajectories  of  the 
particles  are  not  accurately  solved,  the  fast  particles  may  jump  over  the  peaks  of  the  electric  field 
which  stand  near  the  N+-N  and  N-N+ junctions,  instead  of  "seeing"  them. 

Again,  equations  (9)  to  (15)  describe  the  weighted  particle  method.  We  only  need  to  detail  the 
computation  of  Ej(t)  (10  b)  using  the  Poisson  equation.  We  considered  two  methods.  First  the 
classical  "Particle  In  Cell"  (PIC)  method  [24,25],  in  which  one  introduces  a  mesh  of  equally  spaced 
points,  and  an  interpolation  function.  An  assignment  procedure  using  this  function  gives  the 
approximation  of  the  electronic  density  at  the  grid  points.  The  Poisson  equation  is  then  solved  with  a 
finite  difference  scheme,  to  get  the  approximation  of  the  electric  field  on  the  grid  poii  he  field  is 
finally  interpolated  at  the  location  of  the  i-th  particle,  to  obtain  the  value  of  Ej(t). 

The  second  method  uses  the  Green's  functions  of. Poisson's  equation  and  relies  on  an  exact 
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representation  of  the  mutual  Coulomb  interaction  between  the  particles.  Indeed,  an  integration  of  (9) 
with  respect  to  the  wave-vector  gives  a  particle  representation  of  the  electronic  density : 

nhM  =  X  Wjf/t)  6(x  -  xj(t)) 
j=i 

The  "exact"  electric  field,  solution  of  Poisson's  equation  (2,3)  for  this  approximate  density  can  be 
computed  at  the  location  of  the  i-th  particle  by  means  of  the  Green's  kernel  K(x,y)  of  Poisson's 
equation  with  periodic  boundary  conditions : 

K(x,y)dy  =  0  . 


H  (x,y)  =  8(x-y) 


.1 

L 


K(0,y)  =  K{L,y) 


■f 


Therefore  we  let  [5] : 


N 

K(x;(t),y)  nrfy)  dy  -  £  <0j  fj(t)  K(xj(t),Xj(t)) 


i=i 


<i>L-<t»0 

L 


where  <(»„  and  <|>L  are  the  prescribed  boundary  conditions  on  the  potential  at  x=0  and  x=L. 


We  present  simulations  of  a  GaAs  N+N-N+  structure  at  T  =  300  K,  and  we  compare  our 
results  with  those  obtained  by  Baranger  [9].  In  our  simulations,  we  used  a  two-valley  model  for 
GaAs,  and  the  physical  description  of  the  interactions  as  given  by  formula  (8)  while  Baranger's 
model  involves  a  simplified  2-valley  relaxation  time  model  for  the  description  of  the  collisions. 

Figure  5  shows  the  electronic  density  profiles  at  t  =  1  ps  :  the  qualitative  agreement  with 
Baranger's  results  is  satisfactory,  although  our  results  are  more  noisy.  Indeed,  the  effect  of  including 
a  second  valley  is  the  same  for  both  results  :  the  total  electronic  density  increases  in  the  N-N+ 
junction  area,  where  the  T-L  transfer  occurs  (i.e.  where  an  important  fraction  of  electrons  belonging 
to  the  T  valley  reach  high  enough  energies  to  transfer  into  the  L  valley  by  means  of  the  collision 
processes).  The  potential  slightly  decreases  in  the  N  region  (Figure  6),  and  thus  the  field  curvature  is 
changed  near  the  N-N+  junction  (Figure  7).  The  mean  total  current  (which  is  a  function  of  time  only) 
has  an  oscillatory  transient  regime,  and  then  converges  towards  the  value  found  by  Baranger  in  [9] 
(Figure  8).  The  total  mean  velocity  is  lower  in  a  two- valley  model  than  in  a  single-valley  model,  and  it 
reaches  its  maximum  further  from  the  N-N+  junction  (Figure  9).  Figures  10  presents  a  snapshot  of 
the  distribution  function  in  the  L-valley  at  t  *  1  ps  as  a  function  of  position  and  energy,  showing 
that  the  f-L  transfer  clearly  occurs  the  N-N+  junction.  We  refer  the  reader  to  [9]  for  a  detailed 
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discussion  of  the  involved  physical  phenomena. 


6.  CONCLUSION 

We  have  presented  a  new  numerical  method  for  the  Boltzmann  equation  of  semiconductors, 
based  on  a  deterministic  treatment  of  the  collision  opertor  which  may  provide  an  accurate  description 
of  the  physics  of  electron  transport  It  proved  to  be  very  satisfactory  in  the  homogeneous  model  and 
to  provide  useful  informations  for  more  macroscopic  models.  In  the  inhomogeneous  case,  the 
comparisons  show  that  the  method  is  reliable  and  is  able  to  give  an  accurate  picture  of  the  physical 
phenomena  occuring  in  submicron  structures. 
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Figure  1  :  Top  :  total  mean  velocity  (x  107  cm/s) 

Middle  :  total  mean  energy  (eV) 

Bottom  :  Electron  population  in  each  valley  (%) 
as  a  function  of  the  time  (ps).  The  left  curves  are  obtained  by  the  deterministic 
method  ;  the  right  ones  by  a  Monte-Carlo  method  1 17] 


Figure  2  :  Distribution  function  in  the  F  valley  as  a  function  of  the  parallel  (k,) 
and  perpendicular  (k2)  components  of  the  wave-vector  k.  The  six  graphs 
correpond  to  the  times  (from  left  to  right  and  from  top  to  bottom) :  t  =  0  ,  0.5 
,  1  ,  1.5  ,  2  ,  2.5  ps  . 
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Figure  5  :  Left :  T  valley,  L  valley  and  total  electron  densites  (enr3  ,  log 
scale)  at  the  stationary  state 

Right :  total  electron  densities  (cm  3  ,  log  scale)  at  stationary  state 
for  a  2  valley  model  compared  with  a  1  valley  model 

as  a  function  of  the  distance  in  the  structure  (p.m).  The  top  curves  are  obtained 
by  the  deterministic  method  ;  the  bottom  ones  by  Baranger's  model  |9], 


Figure  6  :  Electric  potential  (V)  at  stationary  state,  as  a  function  of  the  distance 
in  the  structure  (pm).  Comparison  between  a  2  valley  and  a  1  valley  model. 
The  left  curves  are  obtained  by  the  deterministic  method  ;  the  right  ones  by 
Baranger's  model  [9]. 


Figure  7  :  Electric  field  (kV/cm)  at  the 
stationary  state,  as  a  function  of  the 
distance  in  the  structure  (pm). 
Comparison  between  a  2  valley  and  a  1 
valley  model. 


EigUfg  8  :  Total  current  (x  104  A/cm2) 
as  a  function  of  the  time  (ps). 
solid  line  :  results  by  the  determinstic 
method 

dotted  line  :  Baranger's  value  of  the 
stationary  currcnt.|9) 
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Figure  9  :  Left :  T  valley,  L  valley  and  total  mean  velocities  (x  107  cm/s)  at 
the  stationary  state 

Right :  total  mean  velocities  (x  107  cm/s)  at  the  stationary  state  for 
a  2  valley  model  compared  with  a  1  valley  model 

as  a  function  of  the  distance  in  the  structure  (pm).  The  top  curves  are  obtained 
by  the  deterministic  method  ;  the  bottom  ones  by  Baranger’s  mode)  (9). 


EifiluaiQ  :  Distribution  function  in  the  L  valley  as  a  function  of  the  distance  in 
the  structure  (rim  ;  increasing  from  right  to  left)  and  of  the  energy  (cV). 
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NUMERICAL  SIMULATION  OF  RAREFRIED  GAS  FLOWS 


Hans  Babovsky 

AGTM,  Universitat  Kaiserslautern 


This  talk  is  concerned  with  the  progress  of  the  Arbei  tsgruppe 
Technomathemat.  i k  (AGTM)  in  developing  a  simulation  code  for 
rarefied  gas  flows  and  in  performing  2d  and  3d  calculations  for 
realistic  situations.  This  work  has  been  done  in  connection  with 
a  project  in  the  K  &  D  program  for  the  European  space  shuttle 
Hermes . 


1 .  Theoretical  aspects 

The  reference  point  in  the  development  of  our  simulation  code  is 
an  appropriate  Boltzmann  equation  for  the  description  of  rare¬ 
fied  gas  flows.  Our  intention  is  to  yield  approximate  solutions 
to  this  equation. 


The  Boltzmann  equation  for  a  density  function  f  =  f(t,x,v)  in 
phase  space  has  the  form 

I)  f  =  J  (  f  ,  f  ) 

with  the  free  streaming  operator 


Df  =  (_  +  v..x,f 

and  the  collision  integral  J(f,f).  The  idea  underlying  our 
scheme  is  to  approximate  f  by  a  system  of  N  moving  particles 
( x  ( t )  ,  v . ( t )  ) . <N .  Precisely,  the  problem  is  to  invent  an  (arti- 
ficial) "dynamics"  for  the  N-point  system  in  such  a  way  that  the 
simulated  solution  keeps  close  to  the  exact  solution  of  the 
Boltzmann  equation. 

This  dynamics  is  constructed  by  decoupling  free  flow  and  colli¬ 
sions.  While  the  first  step  is  performed  easily  by  a  translation 
of  the  x-coord i nates : 


(xi'V 


(x.+U‘v.  ,v.  )  , 
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the  second  step  imitating  collisions  through  discontinuous 
changes  of  velocities  is  crucial  for  an  efficient  simulation 
scheme.  In  our  code,  this  is  done  by  choosing  triples  (vc,w.,b^) 
consisting  of  velocity  pairs  (v. ,», )  and  of  (collision)  para¬ 
meters  b.  and  by  applying  an  appropriate  transformation  V'  to 
calculate  the  new  velocities  (v^  ,w^  )  (compare  111): 

v  i  =  V;(vi,wi,bi)  , 


new  , .  ,  . 

«i  =  V(w.  ,  v^.b.  )  . 


In  order  to  be  consistent  with  the  Boltzmann  equation,  choosing 
the  triples  requires  certain  conditions  to  be  satisfied  (see 
[2]).  There  are  several  possibilities:  On  one  hand  the  purely 
random  game  called  "Monte  Carlo  version",  on  the  other  hand 
so-called  "Low  Discrepancy  versions"  reducing  fluctuations.  (Our 
code  presently  applied  uses  random  numbers  as  well  as  Low 
Discrepancy  sequences.  A  detailed  description  of  this  code  will 
be  provided  in  [4].)  The  use  of  these  schemes  is  justified  by  a 
Law  of  Large  Numbers  stating  that  the  simulated  solutions  are 
good  approximations  to  the  exact  solutions  if  the  particle 
number  N  is  large  enough.  (A  proof  of  this  Law  of  Large  Numbers 
goes  along  the  lines  of  the  proof  in  [2],  [3].) 


However,  in  practical  calculations  one  is  forced  to  use  particle 
systems  much  too  small  to  be  in  the  (theoretical)  domain  of 
validity  of  the  Law  of  Large  Numbers.  This  was  motivation  for  us 
to  study  the  behaviour  of  small  particle  systems.  Here,  ergodic 
theory  yiel  strong  mathematical  results  [5].  We  could 

-  show  that  time  averages  of  small  particle  systems  do  not  re¬ 
present  solutions  of  the  Boltzmann  equation  but  are  affected 
with  a  systematic  error  which  is  expected  to  be  the  smaller, 
the  bigger  the  particle  system  is; 

-  for  simple  cases  provide  an  exact  formula  for  the  systemnt ic 
error  which  allows  to  perform  corrections  to  the  simulated 
solut ions ; 

-  prove  that  boundary  conditions  -  for  example  at  artificial 
boundaries  limiting  the  region  of  calculation  -  can  have  a 
strong  influence  on  the  systematic  error;  thus,  well  perform- 
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ing  boundary  conditions  at  non-physical  boundaries  may  have  a 
strong  impact  on  the  validity  of  the  scheme. 

2 .  Aspects  of  modelling 

Aspects  of  modelling  play  an  important  role  when  simulating  gas 
flows  in  realistic  situations.  Aspects  of  particular  interest 
are 

boundary  conditions 
interior  energies 
gas  mixtures 
chemical  reactions. 

Boundary  conditions 

They  have  to  be  modelled  in  such  a  way  that  coefficients  of 
interest  are  obtained  correctly.  For  flows  around  bodies,  of 
particular  interest  are  for  example  drag,  shear  stress  and  lift 
coefficients.  As  experiments  show,  several  of  these  depend  in  a 
very  sensitive  way  on  the  type  of  boundary  conditions  used.  As  a 
consequence,  a  simulation  code  has  to  be  flexible  enough  to 
include  various  classes  of  boundary  conditions  in  order  to 
properly  model  the  situation  of  interest. 

We  have  compared  different  classes  of  boundary  conditions 
including 

specular  reflection 

diffuse  reflection  with  accomodation  coefficient 
the  Cere i gnan i -Lampi s  model. 

To  this  end  we  have  performed  3d  calculations  of  flows  around 
flat  discs  with  different  angles  of  attack  o,  and  have  compared 
these  results  with  experimental  results  obtained  by  Legge  [6]. 
Some  data  for  Argon  are  presented  in  figures  1  and  2  where  we 
compare  the  pressure  drag  coefficient  versus  the  accomodation 
coefficient  and  versus  the  Knudsen  number.  In  figure  2,  the 
experimental  data  are  represented  by  the  points,  while  the  lines 
show  the  simulated  results.  Similar  calculations  have  been  done 
for  the  gas  Nitrogen. 

Interior  energies 

Here,  we  are  mostly  interested  in  imitating  rotational  energies 
which  becomes  necessary  when  simulating  polyatomic  gases.  The 
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most  frequently  used  model  on  the  market  is  the  Larsen-Borgnakke 
model.  We  have  analyzed  its  mathematical  structure  and  extended 
it  to  a  larger  class  of  models.  Furthermore,  we  have  implemented 
Kuscer’s  VHS  model  which  from  a  physical  point  of  view  seems 
much  more  reliable  than  the  models  described  above.  We  have 
considered  also  certain  geometrical  models  like  Loaded  sphere 
models . 

Investigations  of  shock  structures  have  shown  significant 
dependence  of  the  results  on  the  model.  (Results  are  described 
in  papers  to  be  published  soon.)  However,  in  realistic  situa¬ 
tions  the  choice  of  the  most  appropriate  model  may  fail  due  to 
the  lack  of  sufficient  experimental  data. 


Gas  mixtures 

Binary  gas  mixtures  with  approximately  equal  densities  are 
readily  included  in  our  code.  Big  differences  of  densities, 
however,  required  the  introduction  of  weighted  particles  which 
forced  us  to  slightly  change  the  collision  model. 

This  has  now  also  been  successfully  implemented.  As  an  example, 
figure  3  shows  shock  profiles  in  a  two  component  gas. 

Chemical  reactions 

For  the  modelling  of  chemical  reactions  much  theoretical  work  is 
still  necessary.  There  are  several  models  on  the  market  but  none 
of  them  is  very  satisfactory  from  a  theoretical  point  of  view. 
From  a  practical  point  of  view,  further  numerical  invest iations 
and  experiments  are  necessary  to  test  the  relevance  of  these 
models . 

3 .  Computational  aspects 

Presently,  we  are  mainly  concerned  with  the  simulation  of  2d  and 
3d  calculations  of  flows  around  bodies  of  different  shapes.  For 
example,  we  have  performed  2d  calculations  around  double 
ellipses  and  3d  calculations  around  flat  discs  with  different 
angles  of  attack  at  Mach  15.6  (mainly  in  order  to  compare 
boundary  models  with  experimental  data)  and  around  a  delta  wing 
with  Knudsen  numbers  down  to  0.01. 
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For  our  calculations,  we  used  a  rectangular  grid  structure 
(which  allows  to  easily  refind  particles  after  free  flow)  with 
an  adaptive  grid  refinement.  This  cell  system  turned  out  to  be 
efficient  and  to  reduce  the  computational  effort  quite  well. 

The  complete  code  (including  all  the  features  described  above) 
and  has  been  proven  to  be  approximately  five  times  faster  than 
other  simulation  methods.  For  further  details,  see  [4]. 

All  calculations  have  been  performed  on  the  VP  100  in  Kaisers¬ 
lautern  and  the  VP  400  in  Karlsruhe. 
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THE  PERIODIC  BOLTZMANN  SEMICONDUCTOR  EQUATION. 


J.F.  Bourgat  *,  R.Glowinski  2,  P.  Le  Tallec  3,  J.F.  Palmier  4 


Extended  abstract 

This  paper  is  concerned  with  the  numerical  solution  of  the  Boltzmann  semiconductor 
equation.  In  the  case  of  a  problem  homogeneous  in  space,  the  electronic  function  is  solution 
of  the  integro-differential  equation 

%~J-EVkf  =  Q(f)  in  IR3 

J  f(k)dk  =  1. 

Here,  Q  represents  the  linear  scattering  operator  given  by 

QU)  =  *)/(*')  -  *(*;  k')f(k))dk' . 

J  IR 

Our  main  objective  is  the  development  of  a  fast  deterministic  solution  procedure  for  the 
numerical  computation  of  the  steady  state  solutions  of  the  above  problem. 

In  this  framework,  our  discretization  strategy  is  based  on  an  upwind  finite  difference 
approximation  of  the  gradient  terms  and  on  a  deterministic  conservative  calculation  of 
the  integral  operators.  The  resulting  algebraic  system  is  then  solved  by  a  least  squares 
methodology  which  reduces  the  system  to  a  quadratic  minimization  problem  to  be  solved 
by  a  standard  conjugate  gradient  algorithm. 

In  the  axisymmetric  steady  case  problem  with  periodic  boundary  conditions,  which 
models  superlattices,  this  methodology  gives  nice  results  for  all  values  of  the  electric  fields. 
These  results  are  confirmed  by  an  unsteady  analysis  which,  although  more  expensive  nu¬ 
merically,  gives  additional  information  on  the  relaxation  behavior  of  the  distribution  func¬ 
tion  towards  its  steady  state. 
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VARIOUS  VELOCITY-PRESSURE  ELEMENTS* 
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Department  of  Aerospace  Engineering  and  Mechanics 
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Abstract 

A  comparative  investigation,  based  on  a  series  of  numerical  tests,  of  varionc 
velocity-pressure  elements  used  for  incompressible  flow  computations  is  presented.  These 
elements  are  implemented  in  conjunction  with  the  one-step  and  multi-step  temporal 
integration  of  unsteady  Navier-Stokes  equations.  The  group  of  elements  studied  includes 
the  element  with  a  Petrov-Galerkin  stabilization  that  allows  equal-order  (bilinear) 
interpolation  functions  for  velocity  and  pressure.  The  test  cases  chosen  are  the  standing 
vortex  problem,  the  lid-driven  cavity  flow,  and  flow  past  a  circular  cylinder. 

1.  Introduction 

In  this  paper  we  conduct  a  comparative  study  of  various  finite  elements  used  for 
incompressible  flow  computations  based  on  the  velocity-pressure  formulation  of  unsteady 
Navier-Stokes  equations.  The  elements  covered  by  this  study  are  Q1P0  (bilinear  velocity/ 
discontinuous  piecewise  constant  pressure),  Q2Q1  (biquadratic  velocity/  bilinear 
pressure),  pQ2Ql  ("pseudo"  biquadratic  velocity/  bilinear  pressure),  Q2P1#  (biquadratic 
velocity/  discontinuous  piecewise  bilinear  pressure),  Q2P1A  (biquadratic  velocity/ 
discontinuous  piecewise  linear  pressure),  pQ2PlA  ("pseudo"  biquadratic  velocity/ 
discontinuous  piecewise  linear  pressure),  and  QIQI/e  (bilinear  velocity/  bilinear  pressure 
with  Petrov-Galerkin  stabilizer).  We  implemented  these  elements  by  generalizing  the  one- 
step  formulation  developed  by  Brooks  and  Hughes  [1]  for  the  Q1P0  element,  and  also  by 
using  the  multi-step  formulations  presented  in  Tezduyar,  Liou,  and  Ganjoo  [2J.  In  all  these 
formulations  we  use  the  streamline-upwind/Petrov-Galerkin  (SUPG)  method  [1,2]  to 
prevent  the  spurious  oscillations  that  might  appear  in  the  presence  of  dominant  advective 
terms. 


In  the  one-step  formulation  the  SUPG  supplement  to  the  weighting  function  is 
applied  to  all  the  terms  in  the  momentum  equation.  For  the  element  with  equal-order 
(bilinear)  interpolation  functions  for  velocity  and  pressure,  in  addition  to  the  SUPG 
supplement,  another  Petrov-Galerkin  supplement  is  added  to  the  weighting  function  to 
stabilize  the  element  We  will  refer  to  this  supplement  as  the  PSPG  ("pressure-stabilizing" 
Petrov-Galerkin)  supplement.  The  PSPG  supplement  is  defined  by  utilizing  the  ideas 
introduced  by  Hughes  and  Franca  [3].  The  Petrov-Galerkin  formulation  introduced  in  [3]  is 
capable  of  accommodating  arbitrary  orders  of  interpolation  for  the  (steady-state)  solution  of 
Stokes  problem. 


+  This  research  was  sponsored  by  NSF  under  grant  MSM-8796352. 


415 


The  T6  formulation  [2]  is  an  extension  of  the  T3  formulation  [2].  The  T3 
formulation  is  a  three-step  method  and  starts  out  with  a  splitting  scheme  in  which  the 
pressure  and  the  viscous  terms  are  treated  implicitly  in  the  first  and  third  steps  while  the 
advective  terms  are  treated  implicitly  in  the  second  step.  This  type  of  splitting  is  a  special 
case  of  the  kind  found  in  the  0-scheme  [4].  In  the  T6  formulation,  each  step  of  the  T3 
formulation  is  subdivided  into  two  sub-steps  to  isolate  the  advective  terms,  and  the  SUPG 
supplement  is  applied  only  to  the  sub-steps  involving  the  advective  terms.  A  PSPG 
supplement  is  added  to  the  weighting  function  in  the  "Stokes  sub-steps"  when  using  equal- 
order  (bilinear)  functions  for  velocity  and  pressure. 

We  consider  three  numerical  tests:  the  standing  vortex  problem  [5],  the  lid-driven 
cavity  flow  at  Re=400,  and  flow  past  a  circular  cylinder  at  Re=100.  The  purpose  of  the 
standing  vortex  problem  is  to  determine  the  level  of  numerical  dissipation  involved  in  a 
numerical  solution  technique.  The  lid-driven  cavity  problem  involves  singularities  in  the 
pressure  field  and,  therefore,  is  regarded  as  a  stringent  test  case.  The  cylinder  problem  has 
been  studied  by  several  researchers  in  the  past  (see  for  example  [6])  and  has  become  a 
benchmark  problem  [2], 

2.  The  Governing  Equations 

Let  £2  and  (0,T)  denote  the  spatial  and  temporal  domains  with  x  and  t  representing 

the  coordinates  associated  with  Cl  and  (0,T)-  We  consider  the  following  velocity-pressure 
formulation  of  the  incompressible  Navier-Stokes  equations: 

p(chi/ch  +  u»Vu)-V»a=0  on  £2  x  (0,T) ,  (1) 

V  •  u  =  0  on  fix  (0,T) ,  (2) 

where  p  and  u  are  the  density  and  velocity  and  O  is  the  stress  tensor  given  as 

a  =-pI  +  2pe(u)  (3) 

with 

e(u)  =  (Vu  +  (Vu)T)/2.  (4) 

Here  p  and  p  represent  the  pressure  and  viscosity  while  I  denotes  the  identity  tensor.  Both 
the  Dirichlet  and  Neumann  type  boundary  conditions  are  taken  into  account  as  shown 
below: 


u  =g 

on  r?  , 

(5) 

n  •  O  =  h 

on  r*  , 

(6) 

where  T f  and  f  * 

are  complementary  subsets  of  the 

boundary  T . 
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3.  Spatial  and  Temporal  Discretizations 


Let  £  denote  the  set  of  elements  resulting  from  the  finite  element  discretization  of 
the  computational  domain  £2  into  subdomains  Q®,  e=l,2,...,nei,  where  n«i  is  the  number  of 

elements.  We  associate  to  £  the  finite  dimensional  spaces  H1^  and  Hmh,  where  k  and  m 
represent  the  orders  of  the  interpolation  functions  used.  The  trial  and  test  function  spaces 
are  given  as 


Su  =  {uhluhe  (Hkh)nsd,uh  =  gh  0n  T*  )  , 

(7) 

=  {wh  1  wh  e  (  Hkh  )nsd  ,  wh  =  0  on  T g  )  , 

(8) 

sj  =v£  =  (  q  1  q  e  Hmh }  , 

(9) 

where  nsd  is  the  number  of  space  dimensions. 

The  one-step  formulation  employed  in  this  work  is  essentially  the  same  as  the  one 
used  in  [1] :  find  uhes|j  and  pheSp  such  that 

J  wh  •  p  (3uh  /  dt  +  uh  •  V  uh)  dCl  +  je  (wh) :  oh  dfl 
Si  SI 

"el  t 

+  L  J  5h»  [  p  (8uh  /  dt  +  Uh  •  Vuh)  -  V  •  oh]  d£l 
e=1ae 

+  Z  J  eh»  [  p  (duh  /  dt  +  uh  •  Vuh)  -  V  •  oh]  d£2 

c=I£y 

+  j  qhV»uh  d£2  =  Jwh  •  A*1  dT  ,  Vw^e  ,  Vq^eVp.  (10) 
si  r* 

Here  8 h  is  the  streamline-upwind  /  Petrov-Galerkin  (SUPG)  supplement  to  the  weighting 
function  wh  (7],  and  eh  is  another  Petrov-Galerkin  (PSPG)  supplement  to  stabilize  the 
element  against  pressure  oscillations  [3, 7].  We  define  eh  as  follows: 


eh  =  E(l/p)Vqh 

(11) 

E  =  e  Ze  h  /  (2  llu*ll ) , 

(12) 

where  e  is  a  free  parameter  (which  we  normally  set  to  1),  h  is  the  element  length,  u*  is  a 
global  velocity,  and 
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|  Re*/3  0  £  Re*  S  3  | 

ze=  03) 

|  1  Re*  1>3  j 

The  "Reynolds  number"  is  defined  as 

Re*  =  llu*ll  h  /  (2v)  ,  (14) 

where  v  is  the  kinematic  viscosity.  For  the  limiting  values  of  Re’  the  expression  for  E 
takes  the  following  forms: 

E  -4  e  h  /  (2  llu’ll )  for  Re*  -*  °°  ,  (15a) 

E-*  eh2/(12v)  for  Re*  -»  0  .  (15b) 


The  semi-discrete  equations  corresponding  to  equation  (10)  can  be  written  as 
follows: 

Mat  N(v)  +  Kv  -  Gp  =  F  ,  (16) 

GT  v  +  Me  a  +  Ne(v)+  K£  v  -  Ge  p  =  E  +  FE  ,  (17) 

where  v  is  the  vector  of  unknown  nodal  values  of  uh,  a  is  the  time  derivative  of  v,  and  p  is 

the  vector  of  nodal  values  of  ph.  The  matrices  M,  N ,  K,  and  G  are  derived,  respectively, 

from  the  time-dependent,  advective,  viscous,  and  pressure  terms.  The  vector  F  is  due  to 
the  Dirichlet  and  Neumann  type  boundary  conditions  (i.e.  the  g  and  h  terms  in  equations 
(5)  and  (6)),  whereas  the  vector  E  is  due  to  the  Dirichlet  type  boundary  condition.  All  the 
arrays  with  a  superposed  tilde  can  be  decomposed  into  their  Galerkin  and  SUPG  parts: 


M  =  M  +  Mg  , 

(18) 

N  =  N  +  Ng  , 

(19) 

K  =  K  +  Kg  , 

(20) 

G  =  G  +  Gg  , 

(21) 

F  =  F  +  Fg  , 

(22) 

where  the  subscript  5  identifies  the  SUPG  contribution.  Similarly  the  subscript  e  identifies 
the  PSPG  contribution. 
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Remarks 

1 .  The  PSPG  perturbation  is  employed  only  when  using  equal  orders  of  interpolation 
for  velocity  and  pressure. 

2.  The  equation  systems  (16)  and  (17)  can  be  solved  implicitly.  When  using  equal- 
orders  of  interpolation  the  unknowns  are  ordered  node  by  node  leading  to  a 
reasonable  bandwidth;  this  is  defined  as  the  consistent  (C)  system.  However  when 
using  unequal  orders  of  interpolation  for  velocity  and  pressure,  we  reorder  the 
unknowns  in  such  a  way  that  all  unknown  velocities  appear  before  unknown 
pressures;  we  define  this  as  the  consistent-reordered  (CR)  system. 

3.  The  equation  systems  (16)  and  (17)  can  also  be  solved  by  treating  the  velocity 
explicitly  in  the  momentum  equation.  The  way  the  Q1P0  element  is  used  in  [2] 
leads  to  a  symmetric  coefficient  matrix  for  pressure.  The  QlQl/e  element  leads  to 
a  nonsymmetric  coefficient  matrix  for  pressure  due  to  the  presence  of  PSPG 
perturbation  terms.  However,  if  the  PSPG  terms  which  cause  such  nonsymmetry 
are  neglected,  then  the  coefficient  matrix  for  the  pressure  can  be  "symmetrized".  All 
explicit  one-step  computations  presented  in  this  paper  are  based  on  such  a 
symmetrization,  and  the  results  are  obtained  with  2  passes. 


The  T6  formulation  [2]  is  described  as  follows: 

find  un+o  e  (Su  h+8  suchthat 

Jwh.p  (( un+g  _  ujj)  /  (8At )  +  u|j  •  V  u|j )  dfi 

si 

nci  .  v 

+  X  J5».(p((  u"+e-u")/(9At) 
e=l 

+  V  vuJ)]df2  =  0,  V  whe  vjj  ;  (23) 


h  h  h  h 

find  un+e  €  (S„  k+e  and  pn+0  s  Sp  such  that 

J  wh  .  p  (u^+0-  ujj+0)  /  (0At )  dfi  +  Je  (wh) :  c^+0  dfl 
SI  SI 

riei  f  u  —  u  l 

+  S,  j  eh  •  I  P  (u^-  uJ+0)/(0At)  -V.ohn+0]di2 

e=l  Qe 

+  Jqh  V  •  u^+0  dI2  =  J  wh  .  h*+e  ,  V  wh  e  vjj  ,  V  qh  e  Vp  ;  (24) 

si  rfc 
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find  We6  <s£  Wi-e  suchthat 

J  w  h  .  p  ((  u|;+1.e-  uj*  )  /  ((  1-20  )  At ))  d£2 
£2 

+  J  £  (wh) :  o^+0di2  =  Jwh»*n+edr  ,  VwheVu;  (25) 

n  r* 


find  uJ+1.0€(S^+1.e  suchthat 

Jw».  p  ((uhn+1.e-SS+1.e)/((l-20)At)  +  un+i-e*  V  ujj+1.e )  d£2 
£2 

+  I  J  8*  •  IP  «  un+i-e-  “n+i-e  >  / « 1-2e> At  > 
e=l  qc 

+  un+i-0*  V  un+l-e^  ^  =  0  >  V  whe  ;  (26) 


find  unh+,  e  (sj  ^,+1  suchthat 

J  wh  .  p  ((  uj+1  -  uj+1.e )  /  (8At )  +  V,,.e  .  V  uj+1_e )  d£2 
£2 

+  X  J5h.[p((  “n+r un+i-e)/(9At) 

e=t  qc 

+  W8*VW0«dfl  =  0’  V  whe  Vy  ; 

find  uJj+j  e  (Sy  ^j+1  and  pj+1  e  sj  suchthat 

Jwh.p  (u^+1-u5+1)/(0At)d£2+  J  E  (wh) :  o^+,  d£2 
£2  « 

+  X*  Jeh*[p(u5+i-“5+i)/(0At)-V.oIl+1]d£2 

+  JqhV.uhn+,d£2=  Jwh.fcJ+1  ,VWh6  vj  ,  Vqhe  vj  .(28) 

£2  r* 


=  0 


(27) 
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Remarks 


4 .  The  parameter  0  is  the  one  used  in  the  0-scheme  [4];  we  set  it  to  1/3. 

5.  Unlike  the  T6  algorithm  described  in  [2],  we  add  the  SUPG  supplement  in  all 
stages  involving  the  advection  term,  i.e.  in  equations  (23),  (26),  and  (27).  The 

PSPG  supplement  added  in  equations  (24)  and  (28)  is  for  the  QlQl/e  element 
only. 

6.  The  matrix  forms  corresponding  to  equations  (23),  (25),  (26),  and  (27)  can  be 
solved  implicitly  or  explicitly  as  described  in  [2],  The  matrix  form  of  the  two 
"Stokes  sub-steps",  i.e.  equations  (23)  and  (28),  are  quite  similar  to  the  matrix 
form  of  the  one-step  formulation;  they  can  be  solved  implicitly  or  by  treating  the 

velocity  explicidy.  Except  for  the  QlQl/e  element,  the  results  presented  in  this 
paper  are  based  on  the  explicit  treatment  of  all  sub-steps,  and  are  obtained  with  5-3- 
3-3-5-3  passes.  For  the  QlQl/e  element  the  "Stokes  sub-steps"  are  treated 
implicidy,  and  the  results  are  obtained  with  5-1 -3-3-5- 1  passes. 

4.  The  Velocity-Pressure  Elements  Used 

The  velocity-pressure  elements  used  in  this  paper  are  shown  in  Figure  1 . 


o 

i - *' 

Q1P0 


PQ2Q1 


f - ♦ - f 

o  o 

O  •  " 

o  o 

•  -  m — * 

Q2P1# 


o 

0*0 


Q2P1A 


PQ2P1A 


QlQl/e 


•  Velocity  node 
o  Pressure  node 


Figure  1.  The  velocity-pressure  elements  used. 
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We  now  describe  each  element  briefly. 

Q1P0  This  element  employs  bilinear  interpolation  for  velocity  and  discontinuous 

piecewise  constant  interpolation  for  pressure.  It  does  not  satisfy  the  Babuska-Brezzi 
condition  and  is  known  10  suffer  from  spurious  pressure  modes.  Nevertheless  it  is  a 
popular  element. 

Q2Q1  This  is  another  popular  element;  it  employs  biquadratic  interpolation  for 

velocity  and  bilinear  interpolation  for  pressure. 

PQ2Q1  This  is  the  "pseudo"  version  of  the  Q2Q1  element  in  which  the  velocity  is 
piecewise  bilinear  over  each  sub-element.  In  Figure  1  these  sub-elements  are  denoted  by 
dashed  lines. 

Q2P1#  This  element  uses  biquadratic  interpolation  for  velocity  and  discontinuous 
piecewise  bilinear  interpolation  for  pressure.  It  does  not  satisfy  the  Babuska-Brezzi 
condition  and  is  known  to  produce  spurious  pressure  modes. 

Q2P1A  This  element  employes  biquadratic  interpolation  for  velocity  and  piecewise 
discontinuous  linear  interpolation  for  pressure. 

PQ2P1A  This  is  the  "pseudo"  version  of  the  Q2P1A  element  in  which  the  velocity  is 
piecewise  bilinear  over  each  sub-element.  In  Figure  1  these  sub-elements  are  denoted  by 
dashed  lines. 

QlQl/e  This  element  employes  bilinear  interpolations  for  both  velocity  and 
pressure.  Although  this  element  does  not  satisfy  the  Babuska-Brezzi  condition,  it  can  be 
stabilized  by  adding  a  PSPG  supplement  to  the  regular  weighting  function. 

5.  Numerical  Tests  and  Observations 

Although  for  all  elements  computations  were  performed  with  both  one-step  and  T6 
formulations,  we  only  show  the  selected  results  for  the  T6  formulation.  For  the  purpose  of 
comparison,  in  each  problem  the  meshes  generated  by  using  different  elements  have  the 
same  distribution  of  velocity  nodes.  The  nodal  values  of  the  pressure,  stream  function,  and 
vorticity  are  obtained  by  least-squares  interpolation. 

The  standing  vortex  problem 

This  test  problem  was  suggested  to  us  by  Gresho  (see  [5]).  The  purpose  of  the  test 
is  to  get  an  indication  of  how  much  numerical  dissipation  a  formulation  introduces.  The 

flow  is  inviscid  and  is  contained  in  a  1  x  1  box.  The  initial  condition  consists  of  an 
axisymmetric  velocity  profile  with  zero  radial  velocity  and  with  the  circumferential 
velocity  given  as  uq  =  (  5r  for  r  <  .2,  2-5r  for  .2  <  r  <  .4,  0  for  r  >  .4).  Since  this  initial 
condition  is  also  the  exact  steady-state  solution,  the  numerical  formulation  should  preserve 
this  "standing"  vortex  as  accurately  as  possible.  The  finite  element  mesh  is  uniform  and 
contains  20  x  20  elements  for  Q1P0  and  QlQl/e  elements.  For  the  higher-order  elements 
we  use  10  x  10  elements.  The  time  step  is  0.05;  based  on  a  constant  "element  length"  of 
0.05  this  results  in  a  peak  local  Courant  number  of  1 .0. 
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Some  of  the  solutions  obtained  at  t  =  3  (i.e.  after  60  time  steps)  aiv  shown  in 
Figures  2-4.  The  table  below  shows,  for  various  elements,  the  percentage  of  the  vortex 
kinetic  energy  retained  after  60  time  steps. 


Element 

One-step  implicit 

One-step  explicit 

T6  explicit 

Q1P0 

22.6% 

22.6% 

94.7% 

Q2QI 

Unstable 

Unstable 

Unstable 

PQ2Q1 

Unstable 

Unstable 

Unstable 

Q2PI# 

Unstable 

Unstable 

Unstable 

Q2P1A 

84.5% 

84.5% 

92.7% 

PQ2P1A 

85.5% 

85.4% 

91.7% 

QlQl/e 

93.7% 

86.9% 

88.2% 

Clearly  the  T6  formulation  is  less  dissipative  than  the  one-step  explicit  formulation. 
Although  with  the  T6  formulation  all  elements  seem  to  yield  comparable  levels  of 
dissipation,  with  the  one-step  formulation  the  Q1P0  element  shows  significantly  higher 
dissipation.  Moreover  we  observe  that  for  higher-order  elements  the  difference  in  energy 
dissipation  between  the  one-step  and  T6  formulations  is  not  so  large  as  it  is  for  the  Q1P0 

element  We  also  note  that  the  solution  obtained  with  the  pQ2PlA  element  is  very  close  to 

the  solution  obtained  with  the  Q2P1A  element.  We  found  that  the  Galerkin  one-step 
formulation  is  unstable  for  all  elements  except  for  the  Q1P0  element  which  is  only 
marginally  stable  even  with  an  implicit  (consistent-reordered,  expensive)  formulation. 

The  lid-driven  cavity  flow 

In  this  problem  the  lid  of  the  cavity  has  unit  velocity;  based  on  this  velocity  and  the 
dimension  of  the  cavity  the  Reynolds  number  is  400.  We  choose  a  uniform  grid  of  32  x  32 

elements  for  QIP0  and  QlQl/e  elements.  A  uniform  grid  of  16  x  16  elements  is  used  for 
the  higher-order  elements. 

Some  of  the  steady-state  results  are  shown  in  Figures  5-7.  The  results  obtained 
with  the  higher-order  elements  (i.e.  Q2Q1,  pQ2Ql,  Q2P1#,  Q2P1A,  and  pQ2PIA)  are 
all  in  close  agreement.  Also,  the  results  obtained  with  the  Q1P0  element  is  close  to  those 
obtained  with  the  QlQl/e  element.  The  differences  in  the  solutions  obtained  with  the 
higher-  and  lower-order  elements  can  be  attributed  to  the  size  of  the  leakage  area  near  the 
moving  lid. 

Flow  past  a  circular  cylinder 

In  this  problem  we  have  a  uniform  upstream  flow;  the  Reynolds  number  based  on 
the  cylinder  diameter  is  100.  The  different  meshes  employed  are  shown  in  Figure  8. 

Some  of  the  steady-state  results  are  shown  in  Figures  9-11.  Except  for  the  Q2P1# 
element,  which  produces  oscillations  in  the  velocity  field  all  around  the  cylinder,  the  results 

obtained  with  the  higher-order  elements  (i.e.  Q2Q1,  pQ2Ql,  Q2P1A,  and  pQ2PlA)  are 

all  in  close  agreement.  For  the  Q1P0  and  QlQl/e  elements  we  observe  some  small 
differences  in  the  pressure  field;  these  differences  become  slightly  more  noticeable  in  the 
upstream  region.  Unlike  it  was  in  the  driven  cavity  problem,  both  higher-  and  lower-order 
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elements  result  in  solutions  which  are  in  good  agreement  when  one  inspects  the  velocity 

and  the  variables  derived  from  the  velocity.  Pressure  fields,  on  the  other  hand,  exhibit 

some  very  small  differences.  The  drag  coefficients  obtained  with  these  elements  are  1.162 

(Q1P0),  1.149  (Q2P1A),  and  1.154  (QlQl/e). 
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Figure  4.  Solution  of  the  standing  vortex  problem  at  t=3  with  Q1Q1/E/T6. 
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Figure  S.  Driven  cavity  flow  at  Reynolds  number  400:  steady-state  solution  obtained  with  Q1P0/T6. 
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Figure  7.  Driven  cavity  flow  at  Reynolds  number  400:  steady-state  solution  obtained  with  Q1Q1/e/T6. 
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Figure  8.  Meshes  for  (he  problem  of  flow  prut  a  circular  cylinder. 

(5240  elements,  5350  velocity  nodes,  and  5240  pressure  nodes  for  Q1 P0  element, 
1310  elements,  5350  velocity  nodes,  and  3930  pressure  nodes  for  pQ2Pl  A  element, 
1310  elements,  5350  velocity  nodes,  and  3930  pressure  nodes  for  Q2P1A  element). 
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Figure  9.  Flow  past  a  circular  cylinder  at  Reynolds  number  100:  steady-sta'e  solution  obtained  with 
Q1P0/T6. 
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Figure  10.  Flow  past  a  circular  cylinder  at  Reynolds  number  100:  steady-state  solution  obtained  with 
Q2P1A/T6. 
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Figure  1 1 .  Flow  past  a  circular  cylinder  at  Reynolds  number  100:  steady-stale  solution  obtained  with 
Q1Q1/E/T6. 
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Didier  Badouel  -  Thierry  Priol 
IRISA/INRIA  -  Rennes 


Abstract 

The  production  of  realistic  image  generated  by  computer  requires  a  huge  amount  of  computation 
and  a  large  memory  capacity.  The  use  of  highly  parallel  machines  allows  this  process  to  be 
performed  faster.  Distributed  memory  parallel  computers,  like  hypercubes  or  transputer-based 
machines,  offer  an  interesting  ratio  performance/cost  assuming  that  a  load  balancing  and  a 
partition  of  the  data  domain  is  found.  This  paper  deals  with  the  demonstration  that  emulating 
a  shaved  memory  on  these  computers  seems  to  be  the  best  way  to  parallelize  algorithms  like 
ray  tracing  which  use  large  read-only  databases  with  no  obvious  distribution.  Results  are  given 
which  allow  to  compare  with  a  previous  parallel  ray  tracing  algorithm  that  we  have  implemented 
on  an  iPSC/2. 

1  Introduction 

The  ray  tracing  algorithm,  based  on  simple  optics’  laws,  provides  the  simulation  of  illumination 
effects  such  as  the  shading,  the  reflection,  and  the  refraction.  In  order  to  evaluate  the  color 
of  each  pixel  of  an  image,  a  geometric  model  is  used  to  describe  the  objects  in  a  scene  and  a 
photometric  model  is  used  to  define  the  behaviour  of  objects  with  respect  to  light  sources.  Each 
light  intensity  contribution  for  a  pixel  is  evaluated  with  two  kinds  of  computation. 

the  geometric  calculations  evaluate  the  closest  intersection  between  a  ray  and  the  objects  in 
the  scene.  Their  number  increases  with  the  photometric  complexity  of  the  scene,  i.e.  with  the 
number  of  rays,  and  with  the  geometric  complexity  of  the  scene,  i.e  with  the  number  and  the 
shape  of  the  objects.  Several  attempts  have  been  proposed  to  minimize  the  amount  of  ray/object 
intersection.  These  solutions  are  based  on  what  we  cal)  an  object  access  structure  which  allows 
a  fast  search  of  objects  along  a  ray  path.  They  can  be  grouped  in  two  approaches: 

•  creation  of  a  tree  of  bounding  volumes  [24,  18], 

•  subdivision  of  the  scene  extents  in  an  adaptative  way  [2,  12,  17]  or  a  regular  way  (1, 6,  10, 

11). 

the  photometric  calculations,  once  the  impact  point  determined,  are  used  to  evaluate  the  light 
intensity  contribution  of  a  ray.  Their  number  is  proportional  to  the  number  of  intersection 
points.  According  to  the  photometric  properties  of  the  objects,  new  rays  are  shot  from  the 
intersection  point  in  order  to  take  into  account  the  contribution  to  the  pixel  intensity  of  the 
neighboring  objects  [9,  14,  28],  In  fact,  if  the  object  is  transparent  (respectively  reflective)  then 
a  new  ray  is  shot  in  the  refracted  direction  (respectively  in  the  reflective  direction). 
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The  purpose  of  this  paper  is  not  to  compare  the  different  models  used  for  the  geometric  and 
the  photometric  calculations,  a  detailed  discussion  can  be  found  in  [23]  or  in  [10].  The  aim  of 
the  paper  is  to  focus  on  the  problem  of  the  amounts  of  computation  and  memory  requirement 
whatever  algorithmical  choices  which  are  made.  For  our  parallel  ray  tracing  algorithm,  we  have 
opted  for  : 

•  a  polyhedral  description  for  the  objects. 

•  a  regular  grid  as  object  access  structure. 

•  the  Whitted  model  [28]  improved  by  the  Phong  model  [21]  for  the  photometric  evaluation. 

Computing  realistic  images  require  several  millions  rays,  and  several  hundreds  of  thousands 
objects.  For  each  ray,  the  closest  intersection  point  with  the  scene  must  be  computed.  Thus, 
it  is  the  great  number  of  ray/object  intersections  which  makes  the  ray  tracing  so  expensive. 
Despite  the  improvements  for  ray/scene  intersections,  the  ray  tracing  algorithm  still  is  too  slow 
on  sequential  computer.  Moreover,  latest  researches  such  as  stochastic  sampling  [8,  19]  and 
sophisticated  light  models  [7,  16,  25]  require  more  and  more  computations.  New  algorithmical 
improvements  could  not  decrease  substantially  the  synthesis  time.  Therefore,  since  few  years, 
we  are  studying  the  use  of  Distributed  Memory  Parallel  Computers  (DMPC)  which  are  low  cost 
supercomputers.  A  DMPC  is  a  MIMD  computer  where  each  processor  has  a  local  memory  used 
to  store  its  own  code  and  data.  All  the  processors  of  a  DMPC  are  connected  through  a  network. 

For  our  experimentations,  we  have  used  an  iPSC/2.  The  iPSC/2  system  consist  of  a  cube 
connected  to  a  host  processor.  The  cube  houses  all  the  nodes  connected  through  a  hypercube 
network  topology.  Each  node  consists  of  the  Intel  80386  microprocessor  supplied  with  a  80387 
floating  point  co-processor  and  4  Mbytes  of  local  memory.  It  is  equipped  with  the  Direct  Connect 
Module  (DCM)  for  high  speed  routing  message  between  nodes.  The  performance  of  a  64  nodes 
are  approximatively  256  MIPS  and  20  MFLOPS.  The  node  support  a  vector  extension  board 
with  peak  performance  20  MFLOPS  per  node.  The  system  available  at  IRISA  is  configured  in 
64  nodes  with  no  vector  extension.  The  host  processor  contains  the  software  development  tools. 
It  is  connected  via  a  special  link  to  node  cube  0.  It  performs  compilation,  program  loading 
and  I/O  operation  with  the  hypercube.  The  iPSC/2  can  be  programmed  in  C  or  FORTRAN.  A 
communication  library  has  been  added  to  these  languages  to  allow  sending  of  receiving  messages 
between  nodes. 

The  standard  programming  methodology  consists  in  subdividing  the  problem  to  be  solved 
in  a  set  of  communicating  tasks  [15]  and  map  them  on  processors.  Each  node  contains  in  its 
local  memory  the  code  and  the  data  for  its  processes,  and  all  the  processes  on  the  cube  and  on 
the  host  can  communicate  via  the  exchange  of  messages.  The  conception  of  a  parallel  algorithm 
requires  special  attentions  for  correctness  and  efficiency,  avoiding  deadlocks  and  ensuring  a  load 
balancing.  The  computation  load  balancing  for  data  driven  problems  (like  ray  tracing),  where 
the  dynamic  behaviour  is  quite  difficult  to  be  modelized,  is  experimentally  measured.  The 
performances  of  a  parallel  algorithm  is  commonly  given  in  terms  of  speed-up  and  efficiency.  The 
speed-up  is  the  ratio  of  the  running  time  of  a  processor  to  the  running  time  obtained  with  p 
processors.  This  quantity  represents,  in  fact,  the  number  of  processors  effectively  used  during 
the  parallel  execution  of  the  algorithm.  As  for  the  efficiency,  it  is  equal  to  the  ratio  of  the  speed¬ 
up  to  the  number  of  processors.  It  represents  the  average  utilization  of  the  processors.  These 
two  quantities  are  enough  to  measure  the  computation  load  balancing  of  a  parallel  algorithm 
but  are  not  the  only  performance  criteria.  In  particular,  for  problems  using  large  databases,  the 
partition  of  the  data  domain  must  be  considered. 
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Before  describing  our  parade!  implementation  of  the  ray  tracing ,  we  first  discuss  on  which 
are  the  computations  and  the  databases  required  in  this  algorithm,  and  what  are  their  amounts. 

2  Computations  involved  in  the  ray  tracing  algorithm 

The  purpose  of  this  section  is  to  present  the  basic  algorithms,  for  the  geometric  computations, 
considered  in  our  ray  tracing.  We  only  present  the  algorithms  necessary  to  solve  ray/scene 
intersections  as  they  represent  more  than  80%  of  the  synthesis  time  for  ray  tracing  complex 
scenes.  In  our  implementation,  we  use  at  one  and  the  same  time  a  regular  grid  and  object 
extends  called  slabs  defined  by  Kay  and  Kajiya  [18],  For  each  ray,  the  grid  provides  a  list  of 
polygons  which  localization  is  near  the  ray  direction.  Then,  before  computing  the  intersection 
ray /polygon,  a  test  is  made  using  the  slabs  to  avoid  this  computation  when  the  ray  pass  outside 
the  polygon  extends.  The  use  of  these  two  filters  at  the  same  time  is  justified  by  the  fact  that 
the  grid  which  minimize  the  synthesis  time  is  not  the  one  which  provides  the  smallest  number 
of  polygons. 

Ray  and  polygon  representations 

•  The  parametric  representation  of  a  ray  is  : 

R{t)  =  0  +  D.t  (1) 

where,  O  is  the  origin  of  the  ray,  D  the  direction  of  the  ray,  and  t  the  parameter  of  the 
representation. 

•  A  polygon  is  described  by  its  vertices  V,  (t  €  {(),••■.  A!  -  ll./V  >  2).  Let  x,,  y,  and  z, 
the  coordinates  of  the  vertex  fj.  The  normal  of  (he  plane  containing  the  polygon,  A',  is 
computed  with  the  cross  product  : 

A’  =  (V,  -  Vo)  x  ( V2  -  V0) 

For  any  point  P  of  the  plane  we  have  P.N  =  cst.  This  constant  value  is  computed  by  the 
dot  product  d  =  -V0.N ■  The  arithmetic  representation  of  the  plane,  is  calculated  once  for 
all,  and  stored  in  the  polygon  description. 

N.P  +  d  =  0  (2) 


Ray/polygon  intersection 

Using  a  ray  tracing  method  with  polygonal  databases,  we  must  define  a  fast  algorithm  to  compute 
ray/polygon  intersection.  A  barycentric  approach  has  been  described  in  [27],  the  following 
algorithm  is  quite  similar  but  faster.  The  goal  of  the  algorithm  is  not  only  to  determine  if  a  Ray 
goes  through  the  polygon,  but  must  then  determine  the  coordinates  of  the  intersection  point 
and  parameters  to  localize  this  point  with  respect  to  the  polygon’s  vertices.  These  parameters 
are  used  to  compute  the  interpolated  normal  at  this  point,  and  can  be  used  also  to  compute  the 
entry  of  a  texture  map. 

The  evaluation  of  the  parameter  t  corresponding  to  the  intersection  point  between  the  ray 
and  the  embedding  plane  of  the  polygon,  can  be  obtained  using  the  equations  (1 )  and  (2)  : 


d  -  N.O 
N.D 


(3) 
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So 


Figure  1:  Parametric  representation  of  the  point  P. 

If  polygon  and  ray  are  parallel  ( N.D  =  0),  or  if  the  intersection  is  ahead  the  origin  of  the  ray 
( t  <  0),  or  if  a  closer  intersection  has  been  already  found  (i  >  mint),  the  intersection  will  be 
rejected.  If  not,  we  must  determine  if  the  intersection  point  is  inside  the  polygon.  This  is  done 
using  a  parametric  resolution.  This  solution  is  based  on  triangles.  If  a  polygon  has  n  vertices 
(n  >  3),  it  will  be  view  as  a  set  of  n  -  2  triangles.  The  only  constraint  is  to  use  convex  polygon. 
The  point  P  (cf  Fig.  1)  is  given  by: 

V^P  =  o.V'oV,  +  i3.V0V2  (4) 

The  point  P  will  be  inside  the  triangle  (VoV^V^)  if  : 

a  >  0,  j3  >  0  and  a  +  0  <  1 

The  computation  of  a ,  /3  requires  to  resolve  a  system  of  three  equations  and  with  two  unknows 
which  can  be  reduce  in  a  system  of  two  equations  with  two  unknows  when  working  in  one  of 
the  plane  perpendicular  to  the  axis.  In  order  to  disctud  the  degenerated  polygons  where  the 
projection  would  be  a  segment,  we  choose  the  plane  of  ’biggest’  projection  (as  in  [27])  computing 
the  value  i0  representing  the  direction  of  the  projection  plane. 

f  0  si  |Wr|  =  M oi(|Nr|,  | Alyl.  |N«1). 

io  =  <  1  si  [yv#|  =  Ma*(|JVr|,|Af„|,  |JV,|). 

{  2  si  |/Vj|  =  Max(\PJr\,  |Ny|,  |N,|). 

Consider  i(  and  »2  {i\  and  i2  €  {0,1,2}),  the  indices  different  from  io  representing  the  two 
other  directions,  and  (u,  v)  the  two-dimentional  coordinates  of  the  vectors  V0P,  Vqv\  and  V'oH, 

uo  =  Pit  -  K),|  “i  =  -  K),,  “2  =  -  K),, 

oo  =  Pi,  -  Kj.j  o>  =  Fi.j  -  Voi}  i>2  =  VJtj  - 

Then,  the  solutions  are  : 


The  interpolated  normal  from  the  point  P  is  obtained  by  : 

,Vp  =  ( 1  —  (r»  +  l))).No  +  o.S\  +  fi'Nj 
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Figure  2:  Running  through  a  3D  grid. 

Using  a  regular  3D  grid 

A  3D  grid  is  a  discrete  representation  of  the  scene  space,  it  subdivises  this  space  in  equal 
parallelepipedic  sub-volumes  called  voxels.  Thus,  the  discrete  run  of  a  ray  through  the  grid  is 
a  set  of  ordered  voxels.  Each  voxel  of  the  grid  contains  a  list  of  identifiers  for  the  polygons 
passing  through  its  volume.  During  the  synthesis  task,  the  polygons  kept  for  a  given  ray  are 
those  member  of  the  voxels  encountered  during  the  run  of  the  ray.  In  order  to  avoid  several 
intersections  of  the  same  polygon  with  a  given  ray,  an  identifier  (Raya)  is  stored  in  the  polygon 
description  and  represents  the  last  ray  compared  with  this  polygon. 

The  method  chosen  to  run  through  the  grid  is  the  same  as  the  one  described  in  [1]  ;  Be¬ 
forehand,  the  first  voxel  encountered  by  the  ray  is  computed.  This  voxel  is  either  the  voxel 
containing  the  origin  of  the  ray  (0)  or  the  entry  voxel  when  the  ray  comes  from  outside  the 
grid.  For  each  ray,  the  following  values  are  initialized  : 

•  the  constants  btx,  Sty  and  Stz  represent  the  increment  of  t  in  each  direction  I,  y,  i. 

•  the  variables  tx ,  ty  and  tz  represent  the  values  of  t  for  the  next  boundary  voxel  when 
crossing  in  the  direction  x,  y  or  s  (cf  Fig.  2). 

The  incremental  run  of  the  grid  is  then  computed  in  a  easy  way  ;  For  each  step,  the  com¬ 
parison  between  tx ,  ty  and  U  gives  the  direction  where  the  next  voxel  is  located.  When  a  step 
is  made  in  the  i  direction,  the  variable  t,  is  incremented  with  the  constant  value  bl,. 

Using  Slabs 

The  slabs  (cf  [18])  are  convex  extents  delimited  by  pairs  of  parallel  planes  (see  in  figure  3  a  2D 
example).  One  slab  is  characterized  by  a  normal  direction  N,  and  two  values  d”“n  and 
such  as  the  equation  of  the  planes  bordering  a  polygon  in  the  direction  Ni  are  : 

N,-  P  +  d”'in  =  0  and  N,  ■  P  +  d™r  =  0  (5) 

The  values  </[""*  and  d""“  are  evaluated  with  the  projection  of  each  vertex  V,  onto  the  line  of 
direction  N,. 

dij  =  N,  ■  Vj  =  min(dij)  <CaT  =  mi x(d,;) 

The  values  dj"1"  and  d[naI  are  stored  in  the  polygon  description.  During  the  synthesis  task, 
the  intersection  ray/slab  gives  a  segment  of  the  ray  such  as  t[mB  <  t  >  These  values  are 

computed  using  the  ray  representation  (Equ.  1)  and  the  slabs  representation  (Equ.  5). 

Jmin  _  fj  n  dmaT  —NO 

jmin  _  ai  NfU  maz  ^ 

‘  '  N,.D  '  N,.D 
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Figure  3:  Description  and  intersection  of  the  slabs  extends. 

Choosing  different  slab  directions  for  each  polygon  would  be  possible,  however,  when  the  slab  di¬ 
rections  are  the  same  for  all  the  polygons,  some  improvements  can  be  made.  The  idea  (proposed 
in  [18]),  is  to  pre-compute  for  each  ray,  the  following  values  : 

S,  =  N,  ■  O  and  T,  =  ^ 

And  then,  the  intersection  ray/slab  only  requires  the  following  computations  : 

t-”’"  =  -  Si)Ti  and  Cor  =  (d"*“x  -  S,)T, 

Since  Di [<]>un,l[nal]  is  an  empty  segment,  we  can  conclude  that  the  calculation  of  the  intersection 
between  the  ray  and  the  object  may  be  avoided. 

Partitioning  the  ray  tracing  algorithm 

The  geometrical  databases  involved  in  these  algorithms  are  the  polygons  description,  and  the 
grid  description  with  its  associated  voxels.  The  amount  of  these  databases  rapidly  reaches  several 
tens  millions  of  bytes  (Mbytes).  For  example,  in  our  results,  we  present  a  database  which  has 
required  140  Mbytes  of  memory.  The  problem  of  memory  amount  becomes  more  crucial  when 
using  texture  databases.  Thus,  our  study  of  parallelisation  did  not  hold  the  algorithms  based 
on  processing  without  dataflow  as  they  do  not  achieve  a  data  distribution  which  allow  to  render 
complex  scenes. 

Since  the  computation  of  each  pixel  is  independent  from  the  others,  the  computation  can  be 
easily  distributed  among  the  processors.  As  there  are  much  more  pixels  than  processors,  load 
balancing  can  be  achieve  by  using  a  server/client  programming  model.  A  server  process  assigns 
the  computation  of  a  pixel  to  a  client  process  running  on  a  non-busy  processor. 

Thus,  the  problem  of  parallelisation  for  such  an  algorithm  is  to  insure  both  a  good  database 
and  computation  partitions.  We  have  got  a  first  experience  with  a  parallel  ray  tracing  (see 
(22,  23])  based  on  processing  with  ray  dataflow.  This  algorithm  took  up  the  Cleary's  idea 
[5]  and  subdivises  the  ray  tracing  problem  into  sub- regions  distributed  among  the  differents 
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Figure  4:  Several  uses  of  the  memory  cache  mechanism. 


processor  elements  (PEs).  The  rays  are  exchanged  between  processors  when  they  left  one  region 
for  the  next  one.  This  experience  had  given  several  interesting  solutions  to  insure  a  static  load 
balancing,  but  the  efficiency  of  this  algorithm  decreases  when  the  number  of  processors  increases. 

The  main  reason  of  this  behaviour  is  due  to  messages  with  an  increasing  number  when  using 
more  PEs,  and  with  not  a  uniform  distribution  among  the  PEs. 

In  conclusion,  to  find  a  good  computation  load  balancing  for  the  ray  tracing  algorithm,  while 
respecting  the  constraints  such  as  the  size  of  each  local  memory  and  the  communication  rate 
between  PEs,  is  very  complex  when  using  a  message  passing  model  of  programming.  Thus,  we 
have  opted  from  now  on  a  shared  memory  model  of  programming  to  solve  our  problem. 

3  Emulating  a  read-only  shared  memory  for  ray  tracing 

Why  such  a  model  for  distributed  memory  parallel  computers  ? 

Resulting  from  the  difficulty  of  the  message  passing  model  of  programming,  severals  studies  have 
been  done  to  define  mechanisms  that  implements  a  shared  data  model  in  distributed  systems 
[3,  4,  20].  The  goals  of  this  works  is  to  provide  a  better  abstraction  of  data  mapping  over  a 
set  of  distributed  memories.  In  order  to  offer  a  general  tool,  while  not  degrading  performances, 
in  [20]  and  [3]  strategies  are  studied  to  maintain  data  consistency  between  copies  of  modified 
variables  .  Our  study  is  in  the  reverse  order,  trying  to  optimize  a  specific  parallel  application 
(ray  tracing),  we  came  up  with  emulating  a  shared  memory.  The  data  management  following 
this  abstraction  is  quite  attractive  but  our  first  objective  is  always  the  efficiency  of  data  accesses. 

The  aim  of  our  study  is  to  show  that  an  emulation  of  a  shared  memory  on  a  DMPC  is  the  best 
way  to  parallelize  algorithms  such  as  ray  tracing  which  use  large  read-only  databases  with  no 
obvious  domain  decomposition.  With  a  DMPC,  a  portion  of  each  node’s  memory  can  be  used 
to  store  a  part  of  the  shared  database  and  the  remaining  portion  as  a  cache  to  speed  up  low 
global  accesses.  The  notion  of  cache,  managed  by  software  in  our  case,  is  the  core  of  an  efficient 
shared  memory  emulation. 

Caches  were  introduced  to  palliate  the  gap  between  fast  processor  cycle  times  and  slow  large 
memory  access  times.  Generally  speaking,  a  memory  cache  is  any  hardware  or  software  device 
storing  in  a  relatively  small  but  fast  access  area  a  selected  part  of  a  database  stored  in  a  larger 
but  slower  access  memory.  A  general  presentation  of  cache  memories  can  be  found  in  [26].  For 
example,  in  the  concept  of  virtual  paging  memory,  the  primary  memory  can  be  viewed  as  a 
cache  for  the  secondary  memory.  The  use  of  a  cache  device  improves  the  bandwidth  between 
the  processor  and  its  memory,  in  our  case  it  increases  the  bandwidth  between  a  node  and  a 
virtual  global  memory.  Several  devices  use  the  concept  of  memory  cache  as  chown  in  the  figure 
4.  - 
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Figure  5:  A  user  node  memory  description. 

Why  such  a  model  may  be  efficient  for  ray  tracing  ? 

Various  characteristics  of  the  ray  tracing  algorithm  led  us  to  design  a  software  tool  to  emulate 
a  global  memory  access  in  the  context  of  distributed  memories.  These  characteristics  are  as 
follows: 

•  the  huge  amount  of  memory  necessary  for  this  algorithm  makes  the  database  load  balancing 
as  important  as  the  computation  load  balancing.  Increasing  size  problems  are  a  challenge 
for  DMPC  ; 

•  due  to  the  coherence  property  and  topological  property  of  3D  objects,  only  a  small  part  of 
the  whole  database  is  required  at  a  given  time.  Thus  a  caching  mechanism  can  be  efficient 
for  our  problem  ; 

•  due  to  illumination  effects  (shading,  reflection,  refraction),  the  small  part  of  database 
necessary  to  evaluate  one  pixel  is  nearly  impossible  to  be  statically  determined.  Thus  the 
database  memory  management  must  be  dynamic  ; 

•  the  calculation  of  an  image  uses  the  database  in  a  read-only  way,  there  is  no  problem  of 
data  coherency  management. 

How  we  distribute  the  shared  memory  ? 

In  the  ray  tracing  algorithm,  the  shared  memory  is  constituted  by  the  database  and  the  bitmap 
(pixel  map).  The  sharing  of  pixels  will  be  discussed  in  the  next  section.  The  database  contains 
the  photometric  and  geometric  parameters  of  the  objects  constituting  the  scene,  and  last  but 
not  least,  the  objects  access  structure.  The  mechanism  used  to  manage  the  global  memory  is 
called  Object  Paging.  This  designation  includes  two  aspects,  first  the  virtual  memory  emulation 
is  done  only  for  data  memory,  and  second  the  indivisible  item  of  storage  is  an  object. 

We  have  seen  that  in  our  implementation,  objects  are  polygons  and  their  access  structure  is 
a  regular  grid.  Later  on,  we  will  call  object  an  item  of  a  page  of  the  global  database  which  can 
be  transferred  between  local  memories  (a  polygon,  a  voxel  of  the  grid  ...  etc).  An  object  belongs 
to  one  and  only  one  page.  One  constraint  is  that  the  size  of  any  object  must  be  lower  than  the 
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Figure  6:  Accessing  a  global  object. 

size  of  a  page,  and  if  the  size  of  a  page  is  not  a  multiple  of  the  size  of  an  object,  the  difference 
represents  lost  memory  cells.  However,  the  main  advantage  is  that  the  memory  location  of  one 
object  is  contiguous. 

In  our  algorithm,  the  whole  database  is  first  equally  distributed  over  the  set  of  nodes  without 
any  particular  mapping.  Therefore  each  PE’s  memory  almost  contains  the  same  number  of  pages. 
A  local  memory  of  a  PE  is  organized  as  shown  in  figure  5.  Each  local  memory  is  divided  in 
three  parts:  one  containing  the  code  to  execute  the  application,  an  other  to  store  a  portion  of 
the  database,  and  the  free  space  is  used  as  a  cache  memory  to  optimize  global  accesses  to  the 
distributed  global  memory.  The  two  last  parts  (database  memory)  are  divided  into  pages  to 
allow  a  memory  management. 

How  we  access  an  object  of  the  shared  memory  ? 

During  the  synthesis  task,  the  application  (ray  tracing)  can  potentially  access  the  whole  database 
through  a  software  memory  management.  For  each  node,  when  a  cache  default  is  detected,  i.e. 
the  page  is  neither  in  its  local  database  nor  in  the  cache  memory,  then  a  request  is  sent  to  the 
node  responsible  for  this  page.  In  this  case,  when  a  node  receive  the  page,  it  stores  it  in  the 
cache  memory  according  to  a  LRU  (Last  Recently  Used)  policy.  This  search  is  done  during  the 
communication  of  the  new  page,  and  thus  does  not  cause  extra  cost  (cf.  Fig  6). 

In  our  implementation,  an  object  is  characterized  by  two  numbers  :  the  first  one  (id i)  is 
the  identifier  of  the  class  where  the  object  belongs,  and  the  second  one  (idj)  is  the  member 
identifier  inside  this  class.  The  numbers  (idi,idj)  represent  one  unique  location  in  the  global 
memory.  A  class  is  a  set  of  objects  having  the  same  type.  All  the  objects  of  one  class  are  stored 
in  contiguous  pages  to  make  the  global  memory  management  easier.  The  informations  relative 
to  one  class  are  : 

•  /irstpajjfid),  the  first  page  where  the  objects  of  class  id  are  located. 

•  stze^jec((»d),  the  size  of  an  object  of  class  id. 
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•  nb obj tetter page(id),  the  number  of  objects  of  class  id  in  one  page. 
Thus,  in  order  to  access  a  global  object,  we  must  determine  : 

•  in  what  page  the  object  is  located  ? 

nUTTlpage  — ■  f  ir  stpagei.id\)  -f-  id?  /  nbobgect  —per -page 

•  on  what  node  is  located  this  page  ? 


Tluronode  —  nurUpaje  %  nb„odt 

•  where  is  this  page  wrt  this  node  ? 

dePnodc  =  tlU Tflpugp  /  Tlbnode 

•  where  is  the  object  wrt  this  page  ? 

deppagt  =  id?  %  n60sje^{_per-page 

When  we  have  determined  the  address  of  the  page( ©page),  after  getting  this  page  from  an 
other  node  if  necessary,  we  can  evaluate  the  address  of  the  object  as  follows  : 

@ object  =  <®page  +  deppagt  x  stzeokjrct(idt ) 

For  better  performances,  we  have  chosen  the  values  as  power  of  two.  Thus,  all  the  operations 
necessary  to  calculate  the  global  address  of  an  object  only  require  logical  operations. 

Work  distribution  and  bitmap  distribution 

Once  a  shared  database  has  been  emulated,  the  work  distribution  ensuring  a  load  balancing  is 
quite  simple.  Each  PE  is  owner  of  a  part  of  the  bitmap.  For  example,  if  we  use  32  PEs  to 
compute  an  image  with  a  512A'512  resolution,  each  PE  manages  a  32X32  sub-bitmap.  We  use 
square  (or  nearly  square)  sub-bitmap  in  order  to  exploit  as  much  as  possible  the  ray  coherence 
property.  If  the  PEs  could  directly  address  the  frame  buffer,  a  centralized  control  would  not 
be  necessary.  As  we  do  not  have  this  facility  on  the  iPSC/2,  a  copy  of  the  bitmap  is  managed 
by  the  host  computer  of  the  hypercube  in  order  to  access  the  frame  buffer.  The  synthesis  of 
each  sub-bitmap  requires  global  data  accesses  at  the  beginning  of  the  task,  and  progressively 
the  number  of  external  requests  decreases  as  the  memory  cache  keeps  the  pertinent  items  of  the 
global  database. 

When  a  PE  completes  the  computation  of  its  sub-bitmap,  he  sends  a  request  to  get  an  item 
of  work  (i.e  a  set  of  pixels)  from  a  PE  still  working  on  its  own  sub-bitmap.  This  request  moves 
along  a  ring  topology.  If  this  request  goes  back  without  satisfaction,  the  PE  knows  that  the 
image  is  achieved.  This  local  termination  detection  is  sufficient  for  our  application. 

In  order  to  insure  a  good  work  balancing,  the  only  parameter  to  be  determined  is  the  size 
of  this  item  of  work.  If  its  size  is  minimal  (i.e.  item  of  work  =  one  pixel),  then  we  have  the 
best  work  balancing  we  can  obtained,  assuming  that  the  computation  of  one  pixel  is  indivisible 
over  the  set  of  nodes,  but  the  cost  in  communication  is  then  higher.  Therefore,  to  take  benefit 
of  a  good  work  balancing,  we  must  not  generate  more  work  in  communication  than  work  in 
computation,  experimental  results  (see  Fig.  7)  show  that  a  size  of  about  3x3  pixels  offers  a 
good  compromise. 


444 


y. 

100 


7*-“  ■  i_q  : 

- v - 1 

- - * - Ll _ a _ 

95 

1 

3 

- O- 

-U- 

90 

“  4 

i 

- B — 

85 

- B - , 

1 - G - 

- 

*  Teapot  database 

80 

_ 

D  Coupe  database 

*  Rings4  database 

75 

_ 1 _ 1 _ 

_ 1 _ 1 _ L 

Figure  7:  Relative  efficiency  using  different  size  for  an  item  of  work. 


Results 

Tests  of  our  parallel  ray  tracing  has  been  performed  on  a  set  of  scenes  call  Standard  Procedural 
Databases  (SPD)  provide  by  Eric  Haines  (see  [13])  and  other  scenes  (including  the  famous  Teapot 
from  the  university  of  Utah)  described  with  the  Neutral  File  Format  (NFF)  of  Eric  Haines. 
Several  synthesis  times  are  given  by  the  table  in  figure  10. 

First  results  of  our  algorithm  on  the  iPSC/2  are  promising.  If  we  compare  the  results 
obtained  by  this  method  (cf.  Fig.  8  and  9)  with  our  previous  work  [22,  23],  we  can  emphasize 
on  the  improvements  brought  by  the  shared  database  model  of  programming.  The  behaviour  of 
this  algorithm  is  what  a  user  of  parallel  machines  expecting.  Indeed,  the  use  of  more  PEs  allows 
to  solve  problems  faster,  and  to  consider  larger  problems.  This  is  due  to  the  characteristics  of 
the  the  software  global  memory  management  ; 

•  for  a  sufficient  size  of  memory  cache,  the  PEs  can  work  rapidly  since  the  number  of  requests 
to  others  is  small  ; 

•  the  size  of  the  memory  cache  is  flexible.  Indeed,  with  a  memory  fixed-sized  problem,  i.e. 
a  fixed-size  database,  using  more  PEs  increases  the  computation  power  of  course,  but  also 
provides  a  better  memory  management  as  local  cache  memory  increases  (see  Fig.  0). 

One  of  our  goals  is  to  render  a  database  the  largest  as  possible.  At  present,  we  have  rendered 
the  tetra  10  database  which  contains  more  than  one  million  ( 1  048576)  polygons.  The  size  of  this 
scene  with  its  object  access  structure  requires  the  use  of  109  452  pages  (x  1  280  Bytes),  which 
represents  a  shared  memory  of  about  140  MBytes.  The  synthesis  time  with  64  nodes  is  8  mu 
46  sec.  We  can  noticed  that  this  database  can  not  be  rendered  with  32  nodes  (with  4  MBytes 
of  memory  per  node). 

4  Conclusion 

The  aim  of  our  study  to  parallelize  the  ray  tracing  method  is  to  bring  out  a  model  of  parallel 
programming  well  suited  for  this  kind  of  algorithm.  Due  to  the  difficulty  to  appreciate  the 
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Figure  8:  Speedup  for  the  Rings  images. 
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Figure  9:  Efficiency  for  the  Rings  images. 
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Figure  10:  Examples  of  synthesis  times  with  64  nodes. 
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performance  of  the  various  parallel  ray  tracing  algorithms,  we  have  done  and  keep  on  doing 
experiences  on  an  iPSC/2  hypercube.  Comparing  the  behaviour  of  our  first  algorithm  (see 
[22,  23])  using  a  message  passing  model  of  programming  with  the  behaviour  of  the  last  one, 
described  in  this  paper,  which  uses  a  shared  database  model  of  programming,  we  advocate  the 
shared  model  approach  when  using  large  read-only  database  with  no  obvious  distribution. 
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Abstract . 

We  present  the  two  most  classical  domain  decomposition  methods  with  non  overlapping  subdomains  :  a 
conforming  one,  the  Schur  complement  method,  and  a  non  conforming  one,  based  on  introducing  a  Lagrange 
multiplier  in  order  to  enforce  the  continuity  requirement  at  the  interface  of  the  subdomains  . 

We  show  that  these  methods  represent  two  dual  formulations  of  a  condensed  problem  on  the  interface  . 

The  problem  of  the  parallel  implementation  of  these  methods  is  adressed,  and  the  results  of  some  numerical 
experiments  for  ill  conditioned  three  dimensional  structural  analysis  problems  ate  given  . 

1.  Introduction  . 

The  simplest  parallel  algorithm  for  solving  elliptic  partial  differential  equations  is  based  on  solving  the 
complete  problem  through  the  conjugate  gradient  method  with  parallelisation  of  the  matrix-vector  product.  This 
can  be  done  by  performing  in  parallel  the  computation  for  the  lines  or  the  rows  of  the  matrix  associated  with 
different  substructures.  But,  with  sparse  matrices  arrising  from  finite  element  methods,  the  amount  of  computation 
of  a  matrix-vector  product  depends  in  a  linear  way  of  the  number  of  variables.  As  the  data  transfers  depend  in  a 
linear  way  of  the  number  of  variables  too,  the  parallelisation  of  these  products  typically  leads  to  fine  grain  paral¬ 
lelism  . 

Moreover,  for  large  structural  analysis  problems,  using  a  global  conjugate  gradient  method  is  really  problematic, 
because  of  the  ill  conditioning  and  the  large  numbers  of  degrees  of  freedom  . 

It  is  possible  to  decrease  the  dimension  of  the  problem  and  to  get  parallel  algorithms  with  large  granularity 
by  using  domain  decomposition  methods  . 

Some  of  these  methods  involve  overlapping  subregions  and  are  derived  from  the  Schwarz  alternative  principle, 
see  [1]  and  [2],  It  has  been  proved  in  [3]  that  the  Schwarz  alternative  procedure  consists  in  solving  the  condensed 
problem  associated  with  the  Schur  complement  operator  by  the  mean  of  a  block  Gauss-Seidel  algorithm  . 

Other  methods  involve  non-overlapping  subregions.  These  methods  consist  in  solving  a  condensed  problem  on 
the  interface  between  the  sudomains.  The  condensed  operator  is  defined  with  the  help  of  the  inverses  of  local 
matrices  associated  with  independant  local  problems  . 

These  methods  appear  to  be  better  suited  to  finite  element  or  finite  difference  methods,  first,  because  they  lead  to 
solve  <he  condensed  problem  on  the  interface  through  the  preconditioned  conjugate  gradient  method,  and 
secondly,  because  it  is  generally  more  difficult  to  split  an  unstructured  mesh  in  overlapping  than  in  non¬ 
overlapping  subregions.  And,  at  last,  the  Schwarz  alternative  method  is  less  intrinsically  parallel  . 

In  this  paper,  we  present,  first,  the  most  standard  domain  decomposition  method  with  non-overlapping  sub- 
regions  :  the  Schur  complement  method  . 

Secondly,  we  present  a  non  conforming  method,  based  on  introducing  a  Lagrange  multiplier  to  enforce  the  con¬ 
tinuity  requirement  at  the  interface  between  the  sub-domains,  that  we  call  the  hybrid  method,  because  it  is  similar 
to  the  well  known  hybrid  finite  element  method  for  the  elasticity  equations  . 

We  show  that  die  two  methods  are  associated  with  dual  formulations  of  the  condensed  problem  on  the  interface  . 

Then,  we  adress  some  practical  problems  for  implementing  these  methods,  concerning  the  topology  of  the  inter¬ 
face,  and  the  choice  of  the  local  solver  . 
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At  last,  we  give  some  results  with  the  implementation  of  the  hybrid  method  for  solving  an  ill  conditioned  three 
dimensional  structural  analysis  problem  . 

In  the  sequel  of  this  paper  we  shall  use  the  vocabulary  of  the  linear  elasticity.  But  the  methods  presented 
here  can,  of  course,  apply  to  any  second  order  elliptic  partial  differential  equations  . 

2.  The  Schur  complement  method  . 


The  most  classical  domain  decomposition  method  with  non-overlapping  subdomains,  the  so-called  Schur  comple¬ 
ment  method,  is  based  on  the  Gaussian  elimination  of  degrees  of  freedom  inside  the  substructures  .  Consider  the 
linear  elasticity  equations  on  a  domain  £2,  and  K  the  matrix  associated  with  a  Lagrangian  finite  element  approxi¬ 
mation  of  the  displacement  fields  .  Split  the  domain  £2  into  two  open  subsets  £2,  and  £22  with  r3  the  inner  inter¬ 
section  of  the  boundaries  T ,  and  r2  of  £2]  and  £22 . 


Figure  1  :  non-overlapping  domain  decomposition  . 

The  stiffness  matrix  associated  with  the  renumbering  of  the  degres  of  freedom  according  to  the  splitting  of  the 
domain  into  these  three  subsets  can  be  written  in  the  block  form  below  : 

Kti  0  Kjj 

K  =  0  K*  K23 

K/j  KJj  K„ 


The  stiffness  matrices  associated  with  the  linear  elasticity  equations  on  £2,  and  £22  with  Neumann  boundary  con¬ 
dition  on  r3  are: 


The  coefficients  of  the  matrices  and  are  the  contributions  from  the  integrals  over  £2!  and  £22  of  the 
basis  functions  associated  to  the  nodes  of  r3.  and  so  K33  -  K$p  +  . 


one  can  perform  a  Gaussian  elimination  of  the  degrees  of  freedom  inside  the  open  subsets  £2i  and  £22  and  get  the 
following  condensed  problem,  involving  only  the  degrees  of  freedom  on  r3 : 


[  K33  -  K{3  Kf,'  K13  -  K23  K£j  Ku  ]  X,  =  b3  -  k;3  Kf,1  b,  -  K i  K22'  b2 


The  associated  matrix,  the  Schur  complement  matrix  S,  is  symmetric  and  positive  definite  (see  (4]).  This  is 
proved  by  the  following  equality  that  consists  in  a  change  of  basis  for  the  dot  product  associated  with  the  K 
matrix  : 

I  o  0  |  [Ku  0  Ki31  10  -Kr,'K,3  K„  0  0 

0  1  0  0  Ka  K„  0  1  -K2-2'K23  =  0  K22  0 

-KjiKf,1  -K32K22'  I  K„  K32  K33  0  0  I  0  0  S 
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So,  the  problem  (1)  can  be  solved  through  the  conjugate  gradient  method  without  actually  computing  the 
coefficients  of  the  matrix  S  . 

Let  x3  be  a  given  displacements  vector  on  T3.  The  product  Sx3  is  given  by  : 

S  x3  =  S(,)  x3  +  xj 

with  S(1)  =  K$  -  Kf3  Kjj1  Kn  and  S(2)  =  K$  -  K2'3  K22‘  K23 
Computing  the  product  S(1>  x3  involves  two  steps  . 

First  step,  the  solution  of  the  Dirichlet  problem  on  Q,  with  the  boundary  conditions  given  by  x3  on  T3  : 

Kn  X[  =  -  K13  x3 

Second  step,  the  computation  of  the  product  of  the  K(1)  matrix  by  the  vector  (x1,x3)'  that  is  easily  shown  to  be 
equal  to  (  0  ,  S(1)u3  )'  . 

So,  solving  the  equation  (1)  through  the  conjugate  gradient  method  gives  a  parallel  algorithm  with  a  very 
good  granularity  because  the  main  part  of  the  work  consists  in  the  computation  of  local  independant  contributions 
to  the  product  by  the  Schur  complement  matrix,  that  involves  mainly  the  solution  of  independant  local  problems  . 

3.  A  preconditioner  for  the  Schur  complement  method  . 


In  the  previous  section,  we  have  seen  that  the  product  by  the  local  Schur  complement  matrix  can  be  com¬ 
puted  by  solving  a  problem  with  Dirichlet  boundary  conditions  on  the  interface  and  then  computing  the  trace  of 
the  corresponding  internal  forces  . 


That  leads  to  the  following  equation  : 


0 

K„ 

K„ 

Kn  K13 

-1 

0 

S<»  x3 

k3I 

Kfl>. 

0  I 

*3 

The  Schur  complement  matrix  is  the  discrete  operator  associated  with  the  mapping  of  the  trace  of  displacements 
field  on  the  interface  onto  the  trace  of  the  internal  forces.  This  mapping  is  a  so  called  Steklov-Poincanf’s  operator 
(see  [5])  . 

So,  the  inverse  of  the  local  Schur  complement  matrix  can  be  computed  by  mapping  the  trace  of  the  internal 
forces  field  onto  the  trace  of  displacements  field  on  the  interface.  That  leads  to  solve  a  local  problem  with  Neu¬ 
mann  boundary  conditions  . 

Let  us  note  : 


*1 

K  „ 

k13  ‘ 

-1 

0  " 

*3 

.  o 

I 

*3 

equation  (2)  shows  that  x3  is  the  restriction  on  T3  of  the  solution  of  the  Neumann  problem  (7)  : 


'K„ 

K„  ' 

*i 

0 

k3, 

Kff  . 

.  *! 

S'”x3 

An  efficient  preconditioner  for  the  condensed  problem  associated  with  the  Schur  complement  matrix  can  be 
build  with  the  following  shape  : 

M  =  D,  [S(,,r'  D{  +  D2  [S'21]-'  D2  , 

where  the  D,  matrices  are  weighting  matrices  such  that  D!  +  D2  =  I|r,  (see,  for  instance,  [6]  and  [7]  . 

As  the  Schur  complement  matrix  is  a  mapping  of  the  trace  of  displacements  field  on  the  interface  onto  the  trace 
of  the  internal  forces,  the  residual  of  the  conjugate  gradient  algorithm  is  homogeneous  to  a  forces  field,  whenever 
the  problem  is  related  to  the  displacements  field.  The  preconditioner  presented  here  consists  in  mapping  the  gra¬ 
dient  vector  back  in  the  primal  space  associated  to  the  displacements  . 

Computing  this  preconditioner  leads  to  the  same  degree  of  parallelism  as  the  plain  algorithm,  because  it 
consists  mainly  in  solving  independant  local  Neumann  problems,  and  then  assembling  the  local  contributions  on 
the  interface  . 
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4.  The  hybrid  finite  element  method  . 


Another  domain  decomposition  method  is  based  on  a  mechanical  approach  and  involves  the  introduction  of 
a  Lagrange  multiplier  on  the  interfaces  to  remove  the  continuity  constraint . 


The  equations  of  the  linear  elasticity  equations  with  homogeneous  boundary  conditions  are  : 


Au  =  f  in  Cl 

u  =  0  on  To 


with  (Au);  = 


£k> i  00) 

dxj 


(3) 


The  usual  variational  form  of  this  problem  consists  in  finding  u  in  (H'(£2))3  satisfying  the  boundary  condition 
u  =  0  on  r0  wich  minimizes  the  energy  functional  : 


I(v)  =  -ja(v,v)  -  (f,v)  with  a(u,v)  =  JQa,;o,£o,(u)£,,(v)  dx  . 


Let  us  consider  the  same  splitting  of  the  domain  as  in  the  previous  section.  For  a  sake  of  simplicity,  let  us 
assume  that  the  boundary  of  the  interface  r3  is  embedded  in  T0  the  part  of  the  boundary  of  Cl  with  homogeneous 
Dirichlet  conditions.  Then,  the  traces  on  r3  of  the  displacements  fields  u  satisfying  the  boundary  condition  u  =  0 
on  r0  belong  to  the  space  (Hoo2(r3))3  . 

Solving  the  linear  elasticity  equation  consists  in  finding  two  functions  U]  and  u2,  in  the  functional  spaces  VI  and 
V2  of  the  fields  belonging  to  (H’(Qi))3  and  (H'(n2))3  that  satisfy  the  boundary  conditions  on  T ,  and  r2,  which 
minimize  the  sum  of  the  energies  :  I(v)  =  It(v,)  +  I2(vj) ,  with  the  continuity  constraint :  v,  =  v2  on  r3 . 

The  dual  form  of  the  continuity  condition  is  : 

(  v,  -  v2  ,  p  )r,  =  0  for  each  p  in  [(H^(r3))3]'  . 


The  primal  hybrid  variational  principle  is  based  on  removing  the  intersubdomain  continuity  constraint  by 
introducing  a  Lagrange  multiplier  (see  for  instance  [8]  or  [9]).  Under  the  assumption  that  the  so-called 
Ladyzenskaia-Babuska-Brezzi  condition  is  satisfied  : 

Sup(  vj  -  v2  ,p)  2  C  l|il[wifcr,»»r 

l(  Vi  ,  v2  )lvtxvz=l 

one  can  show  (see  [10])  that  the  problem  of  minimization  with  constraint  above  is  equivalent  to  finding  the 
saddle-point  of  the  Lagrangian  : 

L(v,n)  =  I,(v,)  +  Ufvj)  +  (  v,  -  v2  ,  |i  )r,  .  (4) 

This  means  finding  the  fields  (iii,u2)  in  VlxV2  and  the  Lagrange  multiplier  X  in  [(H^r,))3]'  which  verify  : 

L(u,p)  £  L(u,X)  £  L(v,X) , 

for  each  field  v  =  (v^vj  in  VlxV2,  and  each  p  in  [(Ho?(r3))3]'. 

Clearly,  the  left  inequality  imposes  (  Ui  -  u2 ,  p  )rj  £  (  ut  -  u2  ,  X  )r,  and  so  (  U|  -  u2 ,  p  )r,  =  0  for  each  p  in 
((H^(r3))3]' ,  thus  the  continuity  constraint  is  satisfied  by  the  solution  of  the  saddle-point  problem  . 

The  right  inequality  implies  :  Ii(u;)  +  I2(u2)  £  I((v,)  +  I2(v2)  for  each  (v,,v2)  in  (H'(fi))3,  that  means  that  u,  and 
u2  minimize  die  sum  of  the  energies  on  Cl:  and  Cl2  among  the  fields  satisfying  the  continuity  requirement,  and  so 
ut  and  u2  are  the  restrictions  to  Q;  and  Cl2  of  the  solution  of  the  primal  problem  (3)  . 


The  classical  variational  interpretation  of  the  saddle-point  problem  (4)  leads  to  the  equations  : 


Ajiij  +  Bj  X  —  fj 

in  fl) 

Uj  =  0 

°n 

A2U2  -  B2X  =  ti 

in  a2 

0 

11 

3 

on  ronrJ 

B,uj  —  Bjiij  —  0 

on  r, 

A!  and  A2  are  the  differential  operators  of  the  linear  elasticity  equations  on  O,  and  Cl2  with  Neumann  boundary 
conditions  on  r3,  and  B,  and  B2  the  trace  operators  over  F3  of  functions  belonging  to  VI  and  V2  . 

The  analysis  of  these  equations  shows  that  the  Lagrange  multiplier  X  is  in  fact  equal  to  the  interaction 
forces  between  the  substructures  along  their  common  boundary.  Clearly,  to  get  independant  local  displacement 
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problems,  it  is  necessary  to  introduce  the  forces  on  the  interfaces.  On  a  structural  analysis  point  of  view  it  is 
hardly  a  surprise.  Nevertheless,  the  precise  functional  analysis  above  is  usefull  because  it  allows  classical  results 
about  finite  element  approximations  for  hybrid  or  mixed  differential  equations  to  be  used  . 

5.  Discretization  of  the  hybrid  formulation  . 


A  discretisation  with  finite  elements  of  the  hybrid  formulation  (5)  leads  to  the  following  set  of  linear  equa¬ 
tions  in  which  the  notations  of  variables  associated  to  discrete  problems  are  the  same  than  the  ones  formerly  used 
for  the  continuous  formulation  . 

K(1)u,  +  B/X  =  f, 

•  K®u2  -  B2X  =  f2  (6) 

BtUj  —  B2u2  —  0 


By  elimination  of  the  displacements  in  the  equation  (6),  the  problem  can  be  written  with  respect  with  X  only  : 
[  B,  B;  +  B2  K®'1  B2'  ]  X  =  B,  K(lr’  f,  -  B2  Kpr'  f2 


So  X  satisfies  the  following  equation  : 


D  X  =  b 


with  D  =  B|  — B2  j 


K®-' 


K® 


B{ 

l-B  i 


and  b  s 


[b,  -  B2  ] 


Kor 

0 


0 

K®"' 


(7) 


Obviously,  the  D  matrix  is  symmetric  positive.  It  is  definite  if  the  interpolation  spaces  chosen  for  u,  and  u2  and 
X  satisfy  the  discrete  Ladyzenskaia-Babuska-Brezzi  condition  . 

To  be  able  to  use  the  standard  approximation  results  for  the  mixed  or  hybrid  formulations,  it  is  necessary  to  find 
such  finite  elements  spaces  that  the  discrete  Ladyzenskaia-Babuska-Brezzi  condition  is  uniformly  satisfied  accord¬ 
ing  to  h,  the  mesh  size  parameter  . 

But,  generally,  checking  the  uniform  Ladyzenskaia-Babuska-Brezzi  condition  for  the  discrete  problem  may  be 
tough  (see  [11]  and  [12]).  The  finite  elements  used  for  the  Lagrange  multiplier  must  be  associated  with  polyno¬ 
mials  of  one  degree  less  than  the  ones  used  for  the  primal  unknowns,  as  the  Lagrange  multiplier  is  homogeneous 
to  some  partial  derivatives  of  the  solution  of  the  primal  problem.  When  using  the  hybrid  formulation  as 
described  above  to  get  a  domain  decomposition  method,  the  Lagrange  multiplier  is  introduced  just  to  enforce  the 
continuity  condition.  The  values  of  the  discrete  multiplier  do  not  need  to  be  a  good  approximation  of  the  continu¬ 
ous  interaction  forces  between  the  substructures  . 

.So,  satisfying  the  uniform  Ladyzenskaia-Babuska-Brezzi  condition  is  not  necessary  in  this  case.  The  discrete 
form  of  the  continuity  constraint  :  v,  =  v2  on  r3,  can  be  written  simply  :  v,  =  v2  for  each  degree  of  freedom 
located  on  Tj  . 

With  such  a  condition,  the  discrete  B,  matrices  are  just  boolean  restriction  matrices  . 

Taking  this  approximation  is  equivalent  to  have  finite  elements  associated  with  the  same  polynomials  for  the 
Lagrange  multiplier  X  and  for  the  dispacements  fields,  and  to  take  a  collocation  approximation  for  the  integral : 

Jr5(  v,-v2)pdr  ■ 

The  uniform  Ladyzenskaia-Babuska-Brezzi  condition  for  the  discrete  problem  is  not  satisfied.  But  the  displace¬ 
ments  fields  u,  and  u2,  solution  of  the  discrete  hybrid  problem,  are  conforming,  due  to  the  condition  :  u,  =  u2  for 
each  degree  of  freedom  located  on  Tj.  So  these  fields  are,  in  fact,  the  restriction  over  the  two  subdomains  of  the 
solution  of  the  discrete  primal  global  problem,  for  which  the  standard  approximation  results  apply  . 

As  a  consequence  of  this  form  of  discretization,  one  can  see  that  the  method  can  be  applied  even  though  the 
boundary  of  the  interface  r3  is  not  embedded  in  T0  the  part  of  the  boundary  of  Ci  with  homogeneous  Dirichlet 
conditions  . 


6.  Solution  of  the  discrete  hybrid  problem  . 


The  problem  (7)  can  be  solved  through  the  conjugate  gradient  method  for  it  is  possible  to  compute  the  pro¬ 
duct  of  the  D  matrix  by  a  vector,  although  the  matrix  itself  is  never  computed  . 

Let  ft  be  a  vector,  computing  the  product  v  =  D  |X  involves  the  following  three  steps  . 
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Step  one  : 


computation  of  the  matrix-vector , 


Vl 

Bf  ' 

H 

_  v2  . 

,-b2 

that  is  just  a  reordering  operation  because  the  B,  matrices  are  boolean  . 
Step  two  : 


computation  of  the  product , 


w, 

iK<"r‘ ' 

Vi 

w2 

.IK®]-1 

_  v2  . 

that  means  computing  the  solution  of  two  independant  local  sets  of  linear  equations  associated  with  the  local 
linear  elasticity  problems  with  Neumann  boundary  conditions  on  the  interface  1%  . 

Step  three  :  computation  of  the  variation  on  the  interface  of  the  displacements  fields  W]  and  w2 , 


v 


] 


W| 

w2 


=  B,  w,  -  B2  w2  . 


Obviously,  the  main  step  is  the  second  one  and  can  be  performed  in  parallel,  whereas  only  the  step  three  involves 
interprocessor  data  transfers.  So,  this  method  leads  to  a  parallel  algorithm  with  the  same  kind  of  granularity  as 
the  Schur  complement  method  . 


The  B,  matrices  obtained  with  the  discrete  hybrid  method  presented  in  the  previous  section  are  boolean 
matrices.  So,  the  contribution  D("  of  the  subdomain  number  1  to  the  dual  interface  matrix  is  equal  to  : 


D<"  =  Bi  [K("r*  B/  =  [  0  I  ] 


K„ 

K.'j 


K,j 

Kii> 


-1 

0 

I 

Let  us  note  : 


'  K„ 

K„ 

-1 

0 

“ 

.  K>*3 

I 

C, 

CP 

Then  D(1)  is  equal  to  the  matrix  C^"  . 

From  the  previous  relation,  one  can  see  that  the  Ci  and  C^"  matrices  satisfy  the  following  equations  : 

K„  C,  +  K„  C P  =  o 
Kji  C,  +  K$  Ci"  =  I  • 


Hence,  by  elimination  of  the  Ci  matrix  in  the  previous  equations,  we  can  derive  the  folowing  equality  : 

-  K2l  Krf  Kjj  Ci"  +  K&  Ci"  =  [Kg*  -  Kji  Kf,1  Kn  )  Ci"  =  I  . 

So,  the  D("  matrix  is,  in  fact,  the  inverse  of  the  Schur  complement  matrix  Sl"  . 

On  a  functional  analysis  viewpoint,  the  hybrid  method  is  the  dual  method  of  the  Schur  complement 
method,  because  the  condensed  problem  with  the  hybrid  method  is  related  to  the  forces  on  the  interface,  when 
the  Schur  complement  operator  is  related  to  the  displacements  field  on  the  interface  . 

On  the  linear  algebra  viewpoint,  with  the  discretization  presented  here,  the  duality  of  the  two  methods  is  simply 
represented  by  the  following  relation  between  the  two  interface  operators  : 


7.  Topology  of  the  interface  for  conforming  and  non  conforming  domain  decomposition  methods  . 


There  are  some  features  of  the  interface  topology  which  could  make  the  domain  decomposition  method 
with  Lagrange  multiplier  more  suitable  for  a  parallel  implementation  than  the  conforming  Schur  complement 
method  . 

When  a  point  belongs  to  several  subdomains,  the  coefficients  of  the  condensed  matrix  for  the  degrees  of  freedom 
associated  with  this  point  will  be  the  sum  of  the  contributions  of  the  various  subdomains  to  which  the  point 
belongs.  As  regards  data  dependency  within  the  context  of  the  implementation  of  the  method  on  multi-processor 
systems,  each  processor  performing  the  computation  associated  with  one  subdomain,  it  means  that  the  result  of 
the  product  by  the  Schur  complement  matrix  for  such  nodes  will  depend  on  more  than  two  local  contributions  . 
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In  a  distributed  memory  contest  it  means  that  subdomains  are  neighboring  from  the  moment  that  they  have  just 
one  common  node.  In  the  case  of  a  chessboard  decomposition  of  a  two  dimension  domain,  each  subdomain  has 
eight  neighbours.  For  real  three-dimensional  topology,  the  number  of  neighbours  can  be  very  large.  For  a  decom¬ 
position  in  cubes,  for  instance,  there  would  be  as  many  neighbours  as  the  sum  of  the  numbers  of  faces,  edges  and 
vertices  of  a  cube  that  is  26  . 


In  a  shared  memory  context,  there  is  no  problem  with  data  transfers,  but  the  assembly  of  the  result  of  the  product 
by  the  Schur  complement  matrix  is  still  complex,  due  to  the  fact  that  the  number  of  local  contributions  depends 
on  the  location  of  the  point . 


Let  us  now  consider  the  domain  decomposition  method  with  Lagrange  multiplier.  The  only  sequential  part 
of  the  computation  of  the  product  of  the  dual  matrix  D  by  a  vector  lies  in  the  third  step  consisting  in  the  compu¬ 
tation  on  each  interface  of  the  variation  of  the  displacements  fields  : 


[b,  -b2  ] 


W| 

w2 


=  Bt  W[  -  B2  w2 


The  B[  and  B2  matrices  are  the  discrete  operators  associated  with  the  weak  formulation  of  the  continuity  of  the 
displacement  fields  on  the  interface  : 

(  v,  -  v2  ,  p  )r3  =  0  for  each  p  in  [(H^fr',))3]'  . 


If  the  interface  r3  between  two  subdomains  has  a  zero  integral,  this  equation  vanishes.  In  fact  B,  and  B2 
are  discrete  trace  operators,  and  the  continuous  trace  operators  are  defined  only  on  subsets  of  the  boundary  with 
non  zero  integrals.  This  is  still  true  even  though  taking  the  same  degrees  of  freedom  for  the  Lagrange  multiplier 
than  for  the  restrictions  over  the  interface  of  the  displacements  fields,  as  it  was  presented  in  the  previous  section  . 


Figure  2  :  neighboring  domains  with  the  Schur  complement  method  and  the  hybrid  method  . 

For  each  degree  of  freedom  located  on  a  point  belonging  to  more  than  two  subdomains,  there  are  then  as  many 
degrees  of  freedom  for  the  Lagrange  multiplier  as  pairs  of  subdomains  whose  interface  has  a  non  zero  integral  . 


Figure  3  :  degrees  od  freedom  of  the  Lagrange  multiplier  for  intersecting  edges  . 

In  the  case  of  a  chessboard  decomposition  of  a  two-dimensional  domain,  each  subdomain  has  only  four  neigh¬ 
bours,  one  for  each  edge.  Each  degree  of  freedom  for  the  displacements  fields  located  at  a  vertex  is  associated 
with  four  degrees  of  freedom  for  the  Lagrange  multiplier,  one  for  each  edge  intersecting  at  the  vertex.  For  a 
decomposition  in  cubes  of  a  three-dimensional  domain,  there  are  as  many  neighbours  as  faces,  i.e.  only  six  . 
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For  implementation  on  a  parallel  system  with  local  memory,  it  means  that  the  number  of  processors  con¬ 
nected  to  each  node  of  the  system  needs  to  be  equal  to  four  for  a  two-dimensional  splitting,  with  the  same  topol¬ 
ogy  as  for  a  finite  difference  grid,  and  equal  to  six  for  three-dimensional  decomposition,  still  with  the  topology  of 
a  three-dimensional  regular  grid.  In  both  cases  the  number  of  neighbors  is  obviously  minimum  . 

In  both  shared  memory  or  distributed  memory  contexts,  the  assembly  of  the  product  of  a  vector  by  the  dual  inter¬ 
face  matrix  D  is  simpler  because  all  the  points  have  the  same  status,  and  there  are  exactly  two  local  contributions 
to  the  computation  of  the  product  for  all  the  interface  nodes  . 

8.  Presentation  of  a  structural  analysis  problem  for  a  composite  beam  . 

The  test  problem  we  consider  consists  in  solving  the  linear  elasticity  equations  for  a  composite  beam  made 
of  a  little  more  than  one  hundred  stiff  carbon  fibers  bound  by  an  uncompressible  elastomer  matrix  . 


Homogeneization  methods  do  not  work  for  such  a  device  with  macroscopic -scale  discontinuity  and  very  different 
materials.  For  instance  the  Young  modulus  in  the  direction  of  the  axis  of  the  beam  is  53000  MPa  for  the  fibers 
and  7.8  MPa  for  the  elastomer.  But,  due  to  the  composite  feature,  the  finite  element  mesh  for  solving  the  problem 
with  discontinuous  coefficients  must  be  very  refined,  for  it  must  discern  the  material  discontinuity.  That  leads  to  a 
very  large  matrix,  so  the  problem  can  be  solved  only  through  iterative  methods  like  the  conjugate  gradient 
method  . 


Figure  5  :  a  composite  " pencil "  ■ 

However,  substructuring  is  very  easy  in  the  present  case  for  the  beam  is  made  of  similar  jointed  composite  "pen¬ 
cils"  consisting  in  one  of  the  fibers  with  its  elastomer  matrix. 

Furthermore,  it  must  be  noticed  that  the  problem  we  tackle  is  very  ill  conditioned  . 

First,  for  geometrical  reason  when  trying  to  solve  the  pure  bending  problem,  i.e.  the  case  of  a  fixed  bottom  of  the 
beam  and  transverse  stresses  on  the  top.  The  condition  number  of  the  matrix  of  the  problem  increases  with  the 
ratio  of  the  length  of  the  beam  upon  the  width.  This  is  exactly  the  case  we  are  the  most  interested  in  solving  . 

Secondly,  for  material  reasons,  because  of  the  composite  feature,  and  because  of  the  uncompressibility  of  the 
elastomer.  To  enforce  this  constraint,  we  introduce  a  penalty  parameter  and  the  condition  number  increases  when 
this  parameter  tends  to  0  . 

So,  this  problem  gives  a  very  good  example  of  a  stiff  mechanical  engineering  problem  with  natural  sub¬ 
structuring.  For  both  the  ill  conditioning  and  the  high  dimension,  only  supercomputers  allow  to  tackle  it  For  the 
same  reason  and  because  substructuring  is  straightforward,  it  is  a  very  interesting  problem  for  testing  domain 
decomposition  methods  on  parallel  supercomputers. 

Moreover,  it  is  simple  to  build  smaller  test  problems  with  the  same  features  in  solving  the  elasticity  equations  in 
domains  made  of  only  a  few  pencils,  with  the  same  global  ratio  of  the  length  upon  the  width  than  for  the  com¬ 
plete  beam  . 
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9.  Choice  of  the  local  solver  . 


The  first  tests  of  the  hybrid  finite  element  method  have  been  performed  with  solution  of  the  independant 
local  subproblems  through  the  conjugate  gradient  algorithm  . 

According  to  expectations,  the  hybrid  method  gave  good  results  under  the  parallelism  point  of  view.  But 
comparison  with  the  conforming  global  conjugate  gradient  method  with  a  parallel  matrix-vector  product,  yielded 
to  the  conclusion  that  the  non-conforming  hybrid  method  was  generally  much  more  expensive,  although  the  con¬ 
forming  conjugate  gradient  is  less  efficient  for  parallel  processing  . 

The  reason  is  clear  :  the  ratio  of  the  length  upon  the  width  is  higher  for  one  pencil  than  for  a  beam  made  of 
several  pencils.  Thus,  substmcturing  leads  to  local  problems  with  condition  number  greater  than  the  one  of  the 
conforming  primal  problem.  Then  the  choice  of  the  conjugate  gradient  method  for  solving  the  local  equations 
yields  to  a  more  expensive  algorithm  . 

Clearly,  the  solution  consists  in  using  a  direct  method  for  solving  the  local  problems.  It  is  possible  because 
the  number  of  degrees  of  freedom  in  the  substructures  can  be  much  smaller  than  die  one  of  the  complete  domain. 
Furthermore,  as  each  local  set  of  equations  needs  to  be  solved  several  times,  the  time  for  the  LU  decomposition 
is  not  predominant  as  it  is  generally  the  case  for  direct  solution  methods  . 

This  problem  with  the  condition  number  of  the  local  matrices  seems  to  be  linked  with  special  features  of 
the  particular  problem  we  try  to  solve.  But  the  crucial  point  with  domain  decomposition  methods  is  the  conver¬ 
gence  speed  of  the  outer  iterative  process,  because  each  iteration  requires  the  solution  of  all  local  problems  and, 
thus,  is  very  expensive  . 

Thus,  it  is  better  to  locate  the  interfaces  in  regions  where  the  solutions  are  smooth.  And  as  a  consequence  the  ill 
conditioning  due  to  geometry  may  well  be  worse  in  the  subdomains  . 

Moereover,  iterative  methods  are  sometimes  faster  than  direct  methods  because  of  the  cost  of  the  LU  factoriza¬ 
tion  of  the  matrices.  But  when  there  are  many  right  hand  sides,  and  it  is  the  case  with  domain  decomposition 
methods  because  of  the  outer  iterative  procedure,  direct  methods  are  generally  more  effective  . 

Furthermore,  the  domain  splitting  may  be  performed  in  such  a  way  that  the  substructures  have  a  slender  shape,  in 
order  to  get  small  bandwidths  for  the  local  matrices.  So,  the  cost  of  the  computation  of  the  Choleski  factorization 
of  all  the  local  matrices  is  much  lower,  in  both  CPU  time  and  memory  requirements,  than  for  the  matrix  of  the 
complete  problem  . 

At  last,  it  is  clear  that  the  use  of  iterative  methods  for  solving  the  local  problems  prevents  optimal  load 
balancing  because  the  number  of  operations  cannot  be  forecasted  . 

The  hybrid  method  with  solution  of  the  local  problems  through  the  Choleski  method  has  been  implemented 
on  CRAY -2  and  CRAY-YMP832  machines  for  the  three  dimensional  problem  presented  in  the  previous  section 
of  the  paper  . 

Tests  have  been  performed  with  subdomains  consisting  in  one  or  a  few  pencils  and  numbers  of  subdomains 
between  four  and  sixteen  . 

With  highly  optimized  backward  and  forward  substitutions  for  the  local  problems,  (see  [13]),  speed-ups  have 
been  more  than  3  on  CRAY2  and  more  than  7  on  CRAY-YMP.  These  performances  are  nearly  the  best  possible, 
due  to  the  problems  with  memory  contention  on  these  machines.  The  global  computation  speeds  have  been  700 
Mega-flops  on  CRAY2  and  a  little  bit  less  than  2  Giga-flops  on  CRAY-YMP  . 

These  results  prove  the  ability  of  algorithms  based  on  domain  decomposition  methods  with  solution  of  the  local 
problems  through  direct  solvers  to  yield  maximum  performances  with  vector  and  parallel  supercomputers  . 

10.  Some  comparisons  of  the  performances  of  the  hybrid  domain  decomposition  method  and  the  global 
Choleski  factorization  . 

The  table  above  presents  some  results  obtained  with  a  direct  global  solver,  the  Choleski  factorization,  and 
the  hybrid  domain  decomposition  method,  for  three  different  test  problems  . 

In  all  the  cases,  we  computed  the  results  of  the  pure  bending  problem  for  three  dimensional  beams,  consisting  in 
four,  nine,  or  sixteen  composite  pencils  . 

We  indicate  the  number  of  subdomains,  each  subdomain  made  of  one  pencil,  the  global  number  of  degrees  of 
freedoms,  the  number  of  degrees  of  freedom  on  the  interface,  and  the  CPU  times  and  memory  requirements  . 

The  Poisson  ration  for  the  elastomere  is  0.49  . 
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The  CPU  times  include  the  time  for  the  Choleski  factorization  of  all  the  local  matrices  for  the  hybrid  domain 
decomposition  method  . 

The  stopping  criterion  for  the  outer  conjugate  gradient  algorithm  is  : 

|  u,  -u  |/|  a  |  S  10"8 ,  and  |  Ku»  -  b  |/|  b  |  <  10^  , 

where  u  is  the  result  obtained  with  the  global  Choleski  factorization,  and  K  the  global  stiffness  matrix  . 


number  of  subdomains 

4 

9 

16 

global  number  of  d.o.f 

13000 

28000 

48000 

hybrid  domain  decomposition  method 

number  of  interface  d.of. 

24  SO 

7350 

KiiiMI 

number  of  iterations 

130 

210 

300 

cpu  time(s) 

20 

73 

193 

memory  size(mw) 

1.6 

4.5 

9.5 

global  Choleski  factorization 

cpu  time(s) 

15 

15D" 

650 

memory  size(mw) 

3.6 

16 

50 

These  tests  show  that  the  domain  decomposition  method  may  be,  with  a  well  suited  splitting  of  the  domain, 
more  efficient  than  the  direct  solution,  on  both  memory  requirements  and  CPU  time  viewpoints,  even  for  very  ill 
conditioned  three-dimensional  problems  . 

Furthermore,  the  domain  decomposition  method  is  much  better  suited  for  parallel  processing,  and  it  can  be 
efficiently  implemented  on  distributed  memory  machines,  because  the  main  part  of  the  data  lies  in  the  LU 
decomposition  of  the  matrices  of  the  local  problem  that  can  be  located  in  local  memories.  The  data  transfers 
involve  only  the  traces  of  the  fields  on  the  interface,  and,  so,  are  several  orders  of  magnitude  smaller  than  the 
number  of  operations  to  be  performed  in  parallel  for  solving  the  local  sub-problems  (see  [14]  for  a  parallel  imple¬ 
mentation  on  an  Intel  hypercube  machine  of  a  preconditioner  based  on  the  Schur  complement  method) . 

11.  Conclusions  . 

The  Schur  complement  or  the  hybrid  domain  decompositions  methods  appear,  in  practice,  to  have  mixed 
characteristics  of  direct  and  iterative  solution  methods  . 

They  are  iterative  methods,  because  they  consist  in  solving  an  interface  problem  through  the  preconditioned  con¬ 
jugate  gradient  method.  Like  other  iterative  methods,  they  entail  lower  memory  filling  than  direct  methods, 
because  only  the  LU  factorization  of  small  local  matrices  are  to  be  stored  . 

But  with  domain  decomposition  methods,  the  dimension  of  the  problem  to  be  solved  through  an  iterative  method 
is  much  smaller  than  the  dimension  of  the  complete  problem.  And  the  matrix  of  the  condensed  problem  on  the 
interface  is  much  denser  than  the  usually  sparse  matrix  of  the  complete  problem.  And  its  condition  number  is 
lower,  because  the  elimination  of  the  variables  associated  with  the  internal  nodes  represents  some  kind  of  block 
Jacobi  preconditioner . 

These  characteristics  make  these  domain  decomposition  methods,  when  using  direct  local  solvers,  much 
more  robust  than  standard  iterative  methods.  They  represent  a  good  way  to  use  direct  solvers  for  problems  with 
such  large  numbers  of  degrees  of  freedom  that  the  solution  of  the  complete  problems  through  a  LU  factorization 
would  not  be  affordable  . 

The  tests  presented  in  the  previous  section  show  that  domain  decomposition  methods  can,  with  a  well 
suited  splitting  of  the  domain,  be  less  expensive  than  direct  solvers  in  both  CPU  time  and  memory  requirements, 
even  fore  very  ill  conditioned  three  dimensional  structural  analysis  problems  . 

Furthermore,  these  methods  lead  to  a  very  high  degree  of  parallelism,  and  they  are  very  welt  suited  for  being 
implemented  on  parallel  systems  with  local  memories,  like  distributed  memory  or  hierarchical  memory  machines. 

An  open  question  that  needs  to  be  solved  to  make  such  algorithms  general  purpose  solvers  lies  in  the  prob¬ 
lem  of  mesh  splitting  and  interface  localization  . 

Finding  an  optimal  substructuring  requires  to  take  into  consideration  different  problems  . 

The  subdomains  must  have  such  a  shape  that  the  local  matrices  have  a  low  bandwidth  in  order  to  make  the  use 
of  direct  local  solver  efficient.  That  may  lead  to  large  interfaces.  But,  the  less  points  there  are  on  the  interface, 
the  smaller  the  dimension  of  the  dual  problem  is,  that  should  be  better  to  ensure  fast  convergence  of  the  ou.er 
conjugate  gradient . 
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Furthermore,  the  condition  numbers  of  the  local  problems  and  of  the  dual  problem  depend  upon  the  aspect  ratio 
of  the  substructure  and  of  the  interface  . 

To  get  round  the  local  ill  conditioning,  the  use  of  direct  local  solvers  and  a  reorthogonalization  process  for  the 
outer  conjugate  gradient  (see  [15])  seem  to  be  effective  . 

But  the  repercussions  for  the  condition  number  of  the  dual  interface  operator  of  the  geometry  of  the  decomposi¬ 
tion  are  difficult  to  anticipate,  because  they  depend  not  only  on  the  aspect  ratio  of  the  interface  but  also  on  the 
mechanical  features  of  the  global  problem  . 
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ABSTRACT 

The  eacceM  of  highly  parallel  distributed  memory  multiprocessors  will  depend 
mainly  on  their  efficiency  when  running  realistic  application  codes.  This  paper 
concerns  the  adaptation  to  hypercube  multiprocessors  of  a  3-D  Navier-Stokes 
solver  based  on  ADI  algorithm.  Two  solutions  for  the  management  of  data  trans¬ 
fer  throngh  the  communication  network  are  discussed.  Performance  results  of  the 
implementation  on  a  32  nodes  iPSC2-SX  are  also  presented. 

Keywords:  Navier-Stokes  equations.  ADI  algorithms.  Parallel  computers.  Hy¬ 
percube.  Distributed  memory  multiprocessors. 

1.  Introduction.  Numerical  simulation  of  three-dimensional  time- de¬ 
pendant  flows  is  a  field  of  CFD  which  needs  a  large  computational  power 
so  as  to  yield  realistics  results.  In  this  paper  we  study  a  parallel  solver 
for  3-D  unsteady  Navier-Stokes  partial  differential  equations  adequated  to 
distributed  memory  multiprocessors.  Section  2  is  devoted  to  the  numerical 
method  and  the  algorithm  used  to  solve  the  N-S  equations.  The  main  part  of 
the  solver  consists  in  the  solution  of  Poisson  equation  by  a  modified  ADI  al¬ 
gorithm.  Section  3  discusses  some  mapping  strategies  on  parallel  processors 
and  give  performances  obtained  on  the  iPSC2-SX  using  a  ring  ADI  algo¬ 
rithm’s.  Section  4  describes  a  so-called  hypercube  ADI  algorithm’s  which 
used  i-cycle  index-digit  permutations  to  move  data  all  over  the  processors. 
Finally  Section  5  compares  the  two  approaches  for  different  architectural 
parameters  of  the  communication  network. 
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2.  The  3-D  Navier-Stokes  solver.  For  several  years  ONERA  1  and 
LIMS1  *  collaborates  on  the  development  of  2-D  and  3-D  parallel  Navier- 
Stokes  solvers  for  shared  and  distributed  memory  multiprocessors.  A  2-D 
solver  for  the  numerical  simulation  of  unsteady  separated  flows  around  an 
airfoil  at  high  Reynolds  numbers  has  been  implemented  on  a  shared  memory 
multiprocessor  [4],  A  3-D  parallel  solver  is  now  under  development  [5],  which 
is  based  on  the  ADI  method  as  exposed  in  the  following  sections. 

2.1.  Navier-Stokes  equations  and  numerical  method.  The  un¬ 
steady  3-D  Navier-Stokes  equations  for  incompressible  flows  are  written  fol¬ 
lowing  a  velocity- vorticity  (V  -  <2)  formulation  [8]: 

(2.1)  AV+^x<2  =  0 

(2.2)  ~  +  (V  •  ^)<D  =  (<2  •  V)V  +  i/Aw 

at 

The  equations  2.1  and  2.2  are  approximated  by  using  a  centred  finite  differ¬ 
ence  method.  Then  the  alterning  direction  method  of  [3]  is  used  for  both 
the  Poisson  and  the  vorticity  transport  equations  so  as  to  obtain  a  numerical 
solution  with  a  precision  of  order  2  in  space  and  1  in  time. 

2.2.  ADI  algorithm  for  3-D  Poisson  equation.  The  iterative  al¬ 
gorithm  which  has  been  chosen  for  the  solution  of  the  equation  2.1  is  a 
fractional  step  method  with  stabilization  corrections  [6],  which  is  a  general¬ 
ization  of  the  Douglas  ADI  method  [7]. 

The  equation  : 

(2.3)  (i.,+  i.v+fc,yu  =  * 
with  Lx  =  Lv  =  j~T,  L,  =  fat 

is  then  solved  by  iterating  the  three  steps  of  2.4  : 

Iteration  n  : 

.  .  (Lt  —  2w„iII)U"+i  =  -(Lx  +  2  Ly  +  2L,  +  2w„iII)U"  +  2$ 

1  '  1  (Ly-  2wn,„I)U"+>  =  Ly  U“  -  2u,rM,Un+r 

(Lx  -  2w„,,I)DB+1  =  L, U"  -  2w„,,Un+i 
where  w„iX,  wBiV,wni,  are  acceleration  parameters. 

Considering  data  dependencies,  each  one  of  these  three  steps  is  related  to  a 
linear  recurrence  equation  in  one  of  the  space  directions  [10], 

1  Office  National  d’Etudee  et  de  Recherchee  Aeroepatialea 

*  Laboratoire  d'Infonnatique  et  de  Mathlmatiquee  pour  lee  Sciences  de  l'lnginieur 
BP  30,  91406  Oreay  Cedex,  France 
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3.  Distributed  memory  multiprocessors  and  mapping  strate¬ 
gies.  Distributed  memory  architectures  appear  as  a  solution  for  the  design 
of  highly  parallel  multiprocessors.  In  such  systems  a  set  of  processors  (nodes) 
is  connected  in  some  fixed  topology  through  a  communication  network.  As 
the  opposite  of  shared  memory  system,  data  allocation  onto  processors  local 
memory  is  one  of  the  key  to  efficiently  exploit  the  potential  parallelism  of 
an  algorithm.  A  determinate  solution  for  data  allocation  may  inhibit  the 
activity  of  a  part  of  the  processors.  Hence  the  systematic  analysis  of  data 
dependencies  is  an  obligatory  step  in  the  design  of  well  adapted  codes  for 
such  highly  parallel  machines.  Moreover  most  of  these  machines  do  not 
presently  hide  the  parallelism  to  the  user.  Data  flow  between  processors 
must  be  expressed  explicitly  by  means  of  message  passing  primitives. 
Efficient  algorithms  for  these  multiprocessors  are  often  the  result  of  a  trade¬ 
off  between  the  reduction  of  the  time  due  to  data  communication  in  the 
network  and  the  reduction  of  the  computation  time.  Moreover  sending  or 
receiving  a  message  is  an  operation  which  has  an  incompressible  cost.  Then 
some  gain  may  be  obtained  by  reducing  the  number  of  messages  flowing 
through  the  network.  The  next  sections  discuss  the  influence  of  the  map¬ 
ping  strategies  on  the  data  structure  of  the  computational  domain  when 
considering  ADI  algorithm. 

3.1.  Data  structures  for  ADI  algorithm.  Considering  the  compu¬ 
tational  domain  as  a  cube  the  first  question  concerns  its  partitioning  into 
equal  substets  and  the  assignment  of  these  subsets  onto  the  processors.  Sev¬ 
eral  solutions  have  been  already  studied.  In  [11,12]  Saad  exposes  solutions 
for  the  implementation  of  2-D  ADI  algorithms  on  ring  and  grid  networks. 
Saad  also  discusses  alternatives  to  the  standard  Gaussian  elimination  for 
the  tridiagonal  systems  in  ADI  such  as  substructuring  and  cyclic  reduction. 
Results  of  implementation  of  ring  and  substructuring  solutions  for  2-D  algo¬ 
rithms  are  presented  in  [10]  and  estimated  performances  for  the  version  of 
the  3-D  Navier-Stokes  solver  using  substructuring  may  be  found  in  [4]. 

For  the  3-D  case  the  splitting  of  the  cube  into  pencils  is  suited  to  the  ring 
algorithm’s,  see  Figure  1,  while  a  domain  decomposition  into  sub-cubes  is 
adapted  to  the  substructuring  algorithm  on  a  3-D  processors  grid,  see  Fig¬ 
ure  2.  The  estimation  of  the  performances  of  the  substructuring  algorithm 
will  be  presented  in  an  extended  version  of  this  paper.  Staying  with  the 
standard  Gausssian  algorithm  for  the  solution  of  the  three  steps  of  2.4,  we 
discuss  here  an  alternative  to  the  ring  algorithm’s  which  reduces  the  number 
of  messages  travelling  in  the  network  at  a  cost  of  an  increase  in  the  amount 
of  data  transfered. 
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Fio.  1.  Data  structure  and  assignment  for  the  ring  algorithm's 


z. 


■ 

■ 

■ 

■ 

■ 

■ 

■ 

Fio.  2.  Data  structure  and  assignment  for  a  S-D  processors  grid 
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Table  i 

Ring  ADI  on  iPSCS-SX 


one  time  step  of  the  3-D  N-S  code  (643) 

number  of  proc.  16 

32 

time  in  sec.  102 

63 

3.2.  Implementation  of  ring  ADI  algorithm’s  on  the  SPSC2- 
SX.  Considering  the  data  structure  of  Figure  1  the  ring  algorithm’s  has 
been  implemented  on  the  32  nodes  iPSC2-SX  installed  at  ONERA.  The  iN- 
TEL  iPSC2-SX  is  a  distributed  memory  multiprocessor  which  interconnects 
processors  (i80386  microprocessor  coupled  to  a  Weitek  1167  coprocessor)  in 
a  hypercube  topology  network.  In  a  P  processors  hypercube  network,  each 
processor  with  number  i  =  0...P  —  1  is  directly  connected  to  logg(P)  neigh¬ 
boring  processors  according  to  the  binary  representation  of  i  [9], 

The  table  1  presents  the  ellapsed  times  for  one  time  step  of  the  3-D  Navier- 
Stokes  parallel  solver  which  has  been  obtained  with  16  and  32  processors  of 
the  iPSC2-SX.  For  a  643  domain  the  corresponding  ellapsed  time  per  mesh 
point  and  time  step  is  2.4  X  10~*  sec. 

4.  Hypercube  ADI  algorithm’s.  The  3-D  ADI  algorithm  solves  tridi¬ 
agonal  linear  systems  alternatively  in  each  of  the  3  spatial  direction,  say 
1,2,3.  Noticing  that  these  systems  can  be  solved  independently,  then  it 
is  possible  to  transpose  the  data  structure  before  each  step  of  one  ADI  it¬ 
eration  so  as  to  obtain  local  data  storage  in  each  processor  adequated  to 
computation  in  direction  1,2  or,  3.  This  approach  induces  a  clear  separa¬ 
tion  between  the  communication  phase  and  the  computation  phase.  This 
data  permutation,  which  can  be  done  in  logz(P)  steps  using  nearest  neigh¬ 
bors  communications,  may  be  considered  as  a  generalisation  of  the  matrix 
transpose  algorithm  exposed  in  [13], 

Let  us  consider  an  hypercube  with  P  =  P\  x  Pj  =  2?1  x  2,a  processors 
(?i  -i  qz)-  We  assume  now,  without  any  loss  of  generality,  that  the  compu¬ 
tational  domain  is  a  cube  of  dimension  N,  with  N  =  2l  and  l  >  q\.  Consid¬ 
ering  a  splitting  of  the  domain  into  Pa  x  Pi  x  Pi  blocks  of  size  x  P7  x  757, 
a  block  can  be  identified  by  the  tuple  (<, J,  fc)i=o...pa-i,;=o...p, -i,t=o...P,  -1  , 
while  a  processor  is  identified  by  (/,m)i=o.  pj-ilin=o...p,-i- 
Starting  from  the  initial  mapping  suitable  for  parallel  computation  of  the 
ADI  step  in  direction  3  which  is  , 

MAP-3  :  Vfc,  store  4 lock  ( i,j,k )  on  proctaaor  ( i,j )  at  odrcsa  k 

we  detail  in  the  following  how  to  modify  this  mapping  in  order  to  obtain  a 
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storage  of  the  3-D  data  structure  adequated  to  ADI  steps  in  direction  2  or 

1. 

4.1.  Index-digit  permutations  for  ADI  algorithm.  We  use  here 
a  modified  notation  of  the  one  used  in  [2,3]  but  the  techniques  are  similar. 
Explanation  of  the  intermediarry  steps  of  the  data  movement  in  the  hyper¬ 
cube  needs  the  use  of  the  binary  representation  for  the  processors  and  blocks 
identifiers  : 

|  jqi...ji  |  represents  the  block  B(i,j, k)10.  Following  the 

initial  mapping,  this  block  is  stored  at  adress  (kqi...ki) j  on  the  processor 
The  adresses  of  the  differents  blocks  of  the  data  struture 
are  made  of  two  fields,  one  for  the  processor  number,  and  the  other  one  for 
the  adress  in  local  memory.  The  permutation  suitable  for  ADI  algorithm  is 
an  index-digit  permutation.  As  demonstrates  in  [2]  this  kind  of  permutation 
can  be  implemented  as  a  sequence  of  i-cyclea  3  .  It  is  interesting  to  look  at 
what  are  the  permutations  or  inter-processor  communications  that  must  be 
implemented  on  one  specific  processor  (t,  j)io  in  order  to  transpose  the  data 
structure. 

Starting  from  the  MAP-3  mapping  we  want  to  obtain  the  MAP-2  mapping 
(suitable  for  implicit  computation  into  direction  2)  by  implementing  i-cyelea. 
Where,  MAP-2  is  : 

Vfc,  store  block  ( i,k,j )  on  processor  ( i,j )  at  adress  k 

The  first  operations  which  are  necessary  for  this  operation  are  given  in  Table 
2.  Three  different  kinds  of  data  movement  are  involved  in  this  transposi- 
tion.  The  permutation  (labelled  by  -4-4)  concerns  a  modification  of  the 
blocks  storage  in  the  local  memory  of  the  processor  and  gathers  the  blocks 
that  will  be  exchanged.  The  second  one  (labelled  by  — 4)  corresponds  to 
a  communication  between  two  neighboring  processors.  The  third  one  (la- 
(5) 

belled  by  -^-4)  scatters  the  blocks,  obtained  from  a  neighbor  processor,  at 
non  consecutive  adress  in  local  memory. 

Though  more  complicated  4,  the  transposition  of  the  data  structure  so  as 
to  obtain  a  mapping  for  computation  into  direction  1  can  be  implemented 
using  the  same  three  i- cycles. 

We  must  notice  that  each  i-cycle  operation  involves  blocks  and  that  the 
initial  mapping  MAP-3  is  obtained  after  any  even  number  of  ADI  iterations. 


*  an  i-cycle  Is  an  index- digit  permutation  in  which  the  moat  significant  digit  of  the  adrese 
it  exchanged  with  any  other  digit,  either  in  the  adreet  or  the  proceeeor  number. 

from  \  kfL ... t to  B[k9l...k9l~n+ 1  |  tn...titt,.n...li  |  Jf , ■ . ■  Ji ) a 
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Table  2 

From  MAPS  to  MAP- 1 


*(«■«-• 

'i  1  iti—ji  1  kji—kih  — * 
it  1  Kiti-i-ji  1  infcvi-i-friH 

£(««••• 

S(.,2... 

(G\ 

!1  1  1  Jqi^qi  •••^1)2 

B(ioo: 

l*i  1  1  * 

‘1  1  Ki-lkqi~jl  |  iiiiix-1— *1)2 

ii  I  kVlk9l-i-.ji  I  jgtju-i-kih 

B(in...ii  |  kqi...ki  | 


4.2.  Implementation  on  1PSC2-SX.  These  ideas  have  been  used  for 
implementing  a  Poisson  equation  solver  on  the  5-cube  iPSC2-SX.  The  Table 
3  presents  the  iPSC2  FORTRAN  version  of  the  three  i-cyclea  G,E,S.  From 
the  informations  obtained  when  running  this  Poisson  solver,  performance 
of  the  3-D  Navier-Stokes  solver  can  be  precisely  evaluated.  Moreover,  op¬ 
erations  in  each  of  the  space  direction  of  the  computational  domain  being 
independant,  the  32  nodes  of  the  iPSC2  can  be  used  to  simulate  a  1024 
nodes  system  running  the  hypercube  ADI  algorithm.  The  Table  4  gives 
the  evolution  of  the  ellapsed  time  for  computing  one  time  step  of  the  3-D 
Navier-Stokes  solver  versus  the  number  of  processors  (recall  that  the  ring 
algorithm’s  takes  63  sec.  on  a  32  processors  iPSC2-SX). 
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Table  » 

FORTRAN  implementation  of  i-eyclee  for  ADI  on  the  iPSCt 


C  perform  transposition 

DO  1  M=1,PAS(GWDIR) 

TYPE=M+100*GWDIR+1000*COUNT 
IDEB=(1-MYBIT(TRANSDIR,M))*SMBLK+1 
C  gather  blocks 

CALL  PACK(NMBLK,SMBLK,PENCIL(IDEB),BUF) 

C  exchange  blocks 

CALL  CSEND(TYPE,BUF,SMES,NEXTPASS(TRANSDIR,M),1) 
CALL  CRECV(TYPE,BUF(IWORK),SMES) 

C  scatter  blocks 

CALL  UNPACK(NMBLK,SMBLK,PENCIL(IDEB),BUF(lWORK)) 
SMBLK=SMBLK*2 
NMBLK=NMBLK/2 
1  CONTINUE 


Table  4 

Hypercube  ADI  on  iPSCl-SX  (simulation  reeultej 

one  time  step  of  the  3-D  N-S  code  (643) 
number  of  proc.  8  16  32  64  128  256  512  1024 


time  in  sec.  325.9  171  93.9  49  27.6  15.33  9.2  5.8 
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5.  Influence  of  the  communication  network.  The  ring  and  hyper¬ 
cube  algorithms  differs  only  in  the  way  the  data  strucure  is  organized  and 
in  the  management  of  the  inter-processors  communications.  A  simple  model 
of  the  communication  network  will  be  used  to  evaluate  the  relative  perfor¬ 
mance  of  these  two  algorithms. 

To  send  or  receive  a  message  of  length  N  from  one  processor  to  a  neighbor 
takes  a  time  of  : 

(5.1)  '+**7 

Where  r  is  the  start- up  time  (in  sec.)  and  V  the  communication  bandwidth 
(in  MB/sec.).  Using  this  model  the  complexity  of  the  communications  in¬ 
volved  in  one  time  step  of  the  two  versions  of  the  3-D  Navier-Stokes  solver 
can  be  evaluated  : 

•  ring  algorithm  :  4x  (P  -  1)  x  (r  +  x  y)) 

•  hypercube  algorithm  :  2  X  (ft  +  qi)  X  (r  +  X  y)) 

The  Figures  3,  4,  5,  6  show  some  comparisons  of  the  communication  costs 
of  the  two  algorithms  for  different  values  of  the  parameters  r  and  V  and 
assuming  that  the  computational  domain  has  a  dimension  N  —  64.  Let  us 
notice  that  in  this  case  the  ring  algorithm’s  cannot  use  more  than  a  6-cube. 
For  Figure  3  we  take  parameters  representative  of  an  iPSC2  system  (assum¬ 
ing  the  possibility  to  interconnect  4K  processors).  Figure  4  (resp.  5)  show 
estimation  relative  to  a  NCUBE  system  [4|  (resp.  an  AMETEK-2010  system 

[5] )).  In  Figure  6  we  consider  a  ’’modified”  version  of  the  T800  Transputer 

[6]  with  up  to  twelve  links.  5 

6.  Conclusion.  We  have  exposed  a  solution  for  the  management  of 
data  communications  in  a  distributed  memory  multiprocessor  which  extends 
the  use  of  Gaussian  elimination  for  ADI  algorithm  when  the  number  of  pro¬ 
cessors  is  greater  than  the  dimension  of  the  computational  domain.  The 
performances  obtained  on  the  iPSC2-SX  and  the  estimations  done  for  dif¬ 
ferent  parameters  featuring  the  communication  network  show  us  that  the 
efficiency  of  this  solution  depends  on  the  start-up  time  and  communication 
time  ratio.  A  comparison  of  this  approach  with  the  one  based  on  substruc¬ 
turing  techniques  will  be  presented  in  a  forthcoming  paper. 
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Rayleigh  Quotient  Iteration  as  Newton's  Method 


J.  E.  Dennis  and  R.  A.  Tapia 

Rice  University 
Houston,  Texas 
U.S.A. 


Abstract 


The  inverse,  shifted  inverse  and  Rayleigh  quotient  iterations  are 
wellknown  algorithms  for  computing  an  eigenpair  of  a  symmetric  matrix. 
In  this  talk  we  established  that  each  one  of  these  three  algorithms  can  be 
viewed  as  a  standard  form  of  Newton's  method  from  the  constrained 
optimization  literature.  Our  equivalence  leads  naturally  to  a  new  proof  of 
the  cubic  convergence  of  Rayleigh  Quotient  Iteration. 
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Krylov  subspace  methods:  theory,  algorithms,  and 

applications 

Youcef  Saad 
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Abstract 

This  paper  gives  an  overview  of  projection  methods  based  on  Krylov  subspaces  for 
solving  various  types  of  scientific  problems.  The  main  idea  of  this  class  of  methods  when 
applied  to  a  linear  system  Ax  =  6,  is  to  generate  in  some  manner  an  approximate  solu¬ 
tion  to  the  original  problem  from  the  so-called  Krylov  subspace  Span{6,  Ab, . .  .,Am~1b}. 
Thus,  the  original  problem  of  size  N  is  approximated  by  one  of  dimension  m,  typically 
much  smaller  than  N.  Krylov  subspace  methods  have  been  very  successful  in  solving 
linear  systems  (Conjugate  Gradients,  GMRES,..)  and  eigenvalue  problems  (Lanczos, 
Arnoldi,..)  and  are  now  becoming  popular  for  solving  nonlinear  equations.  We  will 
show  some  of  the  main  ideas  in  Krylov  subpace  methods  and  discuss  their  use  in  solving 
linear  systems,  eigenvalue  problems,  parabolic  partial  differential  equations,  Lyapunov 
matrix  equations,  and  nonlinear  system  of  equations.  Some  numerical  experiments  are 
presented  to  illustrate  the  concepts. 


Key  words:  Krylov  subspace  methods,  Conjugate  Gradients,  Parabolic  equations,  nonlinear  Partial  Differ¬ 
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1  Introduction 


In  recent  years  Krylov  subspace  methods  have  become  a  useful  and  popular  tool  for  solving 
large  sets  of  linear  and  nonlinear  equations,  as  well  as  large  eigenvalue  problems.  One  of  the 
main  reasons  for  their  popularity  is  their  simplicity  and  generality.  When  dealing  with  large 
systems  of  equations,  projection  methods  based  on  Krylov  subspaces  are  often  found  to  be 
very  efficient  alternatives  to  the  traditional  approaches  that  are  based  on  direct  methods. 
This  trend  is  likely  to  accelerate  as  models  are  becoming  more  complex  and  give  rise  to 
larger  and  larger  matrices  for  which  direct  methods  become  prohibitively  expensive. 

Because  of  the  success  of  these  methods  in  solving  large  linear  systems  of  equations, 
much  recent  work  has  been  devoted  to  extending  their  applicability  to  solving  other  types  of 
problems  in  Scientific  Computing.  For  example,  there  has  been  substantial  progress  made 
in  using  these  methods  for  solving  the  nonlinear  equations  in  computational  fluid  dynamics 
[37,  22].  In  addition,  recent  work  has  shown  how  these  methods  can  be  used  to  solve 
equations  in  control  such  as  Lyapunov  equations  [32]  and  there  is  current  interest  in  solving 
time  dependent  partial  differential  equations  by  the  method  of  lines  [17]. 

The  purpose  of  this  paper  is  to  describe  the  general  concepts  used  in  Krylov  subspace 
methods  and  to  give  an  overview  of  the  different  ways  in  which  they  are  used.  As  will  be 
seen  the  method  is  fairly  universal  in  that  it  can  be  used  to  provide  approximate  solutions 
to  virtually  any  linear  problem  and  nonlinear.  However,  it  is  clear  that  the  actual  success 
of  the  method  will  depend  critically  on  the  nature  of  the  matrices  at  hand.  Thus,  conjugate 
gradient  type  methods  are  very  successful  for  symmetric  positive  definite  linear  systems  but 
have  been  rather  unsuccessful  with  highly  indefinite  problems. 

The  next  section  is  a  brief  introduction  to  Krylov  subspaces.  Section  3  discusses  the 
application  of  the  method  to  linear  systems,  and  Section  4  is  on  eigenvalue  problems.  Section 
5  will  be  on  evaluating  the  product  of  the  exponential  of  a  matrix  A  times  a  vector  with 
some  applications.  Finally,  Section  6  will  discuss  the  use  of  Krylov  subspace  methods  for 
solving  nonlinear  problems. 

2  Krylov  subspaces 

Given  a  square  matrix  A  and  a  nonzero  vector  v,  the  subspace  defined  by 

Km  =  span  {v,Av,Aiv,...Am~1v}  (1) 

is  referred  to  as  a  the  m-th  Krylov  subspace  associated  with  the  pair  (A,  v)  and  is  denoted 
by  Km(A,v)  or  simply  by  Km  if  there  is  no  ambiguity.  We  start  by  stating  a  few  elementary 
properties  of  Krylov  subspaces.  Recall  that  the  minimal  polynomial  of  a  vector  v  is  the 
nonzero  monic  polynomial  p  of  lowest  degree  such  that  p{A)v  =  0.  Clearly,  the  Krylov 
subspace  Km  is  the  subspace  of  all  vectors  in  <D^  which  can  be  written  as  *  =  p(A)v,  where 
p  is  a  polynomial  of  degree  not  exceeding  m  —  1. 

Proposition  2.1  The  Krylov  subspace  Km  is  of  dimension  m  if  and  only  if  the  degree  of 
the  minimal  polynomial  of  v  with  respect  to  A  is  not  less  than  m. 
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In  practice  it  is  rather  uncommon  that  the  degree  of  the  minimal  polynomial  is  less  than 
N,  even  in  exact  arithmetic.  If  this  were  to  happen  then  it  is  usually  helpful  rather  than 
harmful  because  of  the  following  proposition. 

Proposition  2.2  Let  p  be  the  degree  of  the  minimal  polynomial  ofv.  Then  KM  is  invariant 
under  A  and  Km  —  for  all  m  >  p. 

Thus,  in  case  p  is  small  we  can  work  work  in  subspace  of  dimension  p  and  be  able  to  solve 
the  problem  exactly  in  this  small  subspace. 

Working  directly  with  the  basis  {A*v}j=a . m-i  is  likely  to  lead  to  serious  numerical 

difficulties.  Most  Krylov  subspace  methods  utilize  either  orthogonal  or  bi-orthogonal  bases 
of  Km.  Thus,  the  procedure  introduced  by  Arnoldi  [1]  builds  an  orthogonal  basis  of  the 
Krylov  subspace  Km  by  the  following  algorithm. 

Arnoldi’s  algorithm: 

1 .  Start:  Choose  a  vector  Vi  of  norm  1. 

2.  Iterate:  for  j  =  1,2, . . .  ,m  compute, 


hij  —  (Av,-,Uj)  t  -  l,2,...,j  (2) 

i 

w  =  Avj  -  53  hijVi  (3) 

\>+U  =  IMIj  (4) 

Vj+i  =  w/hJ+ ij  (5) 


This  algorithm  is  mathematically  equivalent  to  a  Gram-Schmidt  process  applied  to  the 
power  sequence  u,  An, ....,  Am_1v,  in  that  it  would  deliver  the  same  sequence  of  v.’s  in  exact 
arithmetic.  The  algorithm  will  stop  if  the  vector  w  computed  in  (4)  vanishes  which  happens 
if  the  degree  of  the  minimal  polynomial  for  v  is  j.  This  is  referred  to  a  ‘lucky’  breakdown  since 
as  was  seen  above  it  means  that  the  original  problem  (linear  system,  eigenvalue  problem) 
can  be  solved  exactly  in  a  y'-th  dimensional  subspace. 

The  following  are  a  few  simple  but  important  properties  satisfied  by  the  algorithm. 

Proposition  2.3  The  vectors  Vi,  vj, . . . ,  vm  form  an  orthonormal  basis  of  the  subspace  Km  = 
span(t>i,  At)!, . . . ,  Am_1v,}. 

Proposition  2.4  Denote  by  Vm  the  N  x  m  matrix  with  column  vectors  Vj, . . .  ,vm  and  by 
Hm  the  m  x  m  Hessenberg  matrix  whose  nonzero  entries  are  defined  by  the  algorithm.  Then 
the  following  relations  hold: 


A —  VmHm  -f  hm+i,mvm+iem  (6) 

V*AVm  =  Hm  (7) 
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Note  that  when  A  is  symmetric  then  (7)  implies  that  the  matrix  Hm  is  tridiagonal 
symmetric  and  as  a  result  Arnoldi’s  algorithm  simplifies  into  an  algorithm  which  involves 
only  three  consecutive  vectors  at  each  step.  The  corresponding  algorithm  is  the  well-known 
Lanczos  algorithm. 

The  second  of  the  relations  in  the  proposition  indicates  that  the  Hessenberg  matrix  Hm 
is  nothing  but  the  matrix  representation  of  the  projection  of  A  onto  Am,  with  respect  to 
the  orthogonal  basis  Vm.  Analysis  of  various  projection  methods  based  on  Krylov  subspaces, 
indicate  that,  loosely  speaking,  Am  contains  the  most  significant  information  of  A,  in  that 
the  outermost  eigenvalues  of  A  are  well  represented  by  those  of  its  projection  onto  Am, 
for  large  enough  m.  The  main  idea  of  Krylov  subspace  methods  is  to  project  the  original 
problem  into  Km.  In  the  next  sections  we  will  see  how  this  is  done  via  simple  Galerkin  type 
procedures,  for  standard  linear  algebra  problems.  Then  in  the  following  sections  we  will 
address  other  types  of  problems. 

The  relation  (6)  has  been  exploited  in  [13]  for  solving  special  Sylvester’s  equations  that 
arise  in  the  design  of  reduced-dimensional  state  estimator.  The  Arnoldi  and  block-Arnoldi 
algorithms  have  been  used  in  [5]  to  compute  numerically  the  controllability  of  a  linear  system. 


3  Krylov  subspace  methods  for  solving  linear  systems 

Given  an  initial  guess  x0  to  the  linear  system 

Az  =  b,  (8) 

a  general  projection  method  seeks  an  approximate  solution  xm  from  an  affine  subspace  xo+ Am 
of  dimension  m  by  imposing  the  Petrov-Galerkin  condition 

b  -  AXm-LLm  (9) 

where  Lm  is  another  subspace  of  dimension  m.  A  Krylov  subspace  method  is  a  method  for 
which  the  subspace  Am  is  the  Krylov  subspace 


Km(A,r0)  =  3pan{r0,Ar0,A2r0,...,Am  ]r0},  (10) 

in  which  r0  =  b—  Ax0.  The  different  versions  of  Krylov  subspace  methods  arise  from  different 
choices  of  the  subspaces  Km  and  Lm  and  from  the  ways  in  which  the  system  is  preconditioned. 
The  most  common  choices  of  Am  and  Lm  are  the  following. 

1.  Lm  =  Am  =  Am (A,  7*o).  The  conjugate  gradient  method  is  a  particular  instance  of 
this  case  when  the  matrix  is  symmetric  positive  definite.  Another  method  in  this  class  is 
the  Full  Orthogonalization  Method  (FOM)  [29]  which  is  closely  related  to  Arnoldi’s  method 
for  solving  eigenvalue  problems  [1],  Also  in  this  class  is  ORTHORES  [19],  a  method  that  is 
mathematically  equivalent  to  FOM.  Axelsson  [2]  also  derived  a  similar  algorithm  for  general 
non8ymmetric  matrices. 

As  an  example  we  outline  here  the  FOM  method  for  solving  linear  systems.  Assume  that 
we  take  Vi  —  r0/||ro||j  and  run  m  steps  of  Arnoldi’s  method  described  in  the  previous  section. 
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Then,  the  approximate  solution  is  of  the  form  z<)  +  Vmym  where  ym  is  some  m-vector.  The 
Galerkin  condition  (9)  with  Lm  =  Km  gives  immediately  that  ym  =  ^Mkollre,. 

2.  Lm  =  AKm\Km  =  Km(A,r0).  With  this  choice  of  Lm,  it  can  be  shown,  see  e.g.,  [33] 
that  the  approximate  solution  xm  minimizes  the  residual  norm  ||h  —  j4z||  a  over  all  candidate 
vectors  in  zo  +  Km.  In  contrast,  there  is  no  similar  optimality  property  known  for  methods 
of  the  first  class  when  A  in  nonsymmetric.  Because  of  this,  many  methods  of  this  type  have 
been  derived  for  the  nonsymmetric  case  [3, 19, 15,  34].  The  Conjugate  Residual  method  [10] 
is  the  analogue  of  conjugate  gradient  method  that  is  in  this  class.  The  GMRES  algorithm 
[34]  is  an  extension  of  the  Conjugate  Residual  method  to  nonsymmetric  problems. 

3.  Lm  -  Km(AT,r0)-,  Km  =  Km(A,r0).  Clearly,  in  the  symmetric  case  this  class  of 
methods  reduces  to  the  first  one.  In  the  nonsymmetric  case,  the  biconjugate  gradient  method 
(BCG)  due  to  Lanczos  [21]  and  Fletcher  [16]  is  a  good  representative  of  this  class.  There 
are  various  mathematically  equivalent  formulations  of  the  biconjugate  gradient  method  [30], 
some  of  which  are  more  numerically  viable  than  others.  An  efficient  variation  on  this  method, 
called  CGS  (Conjugate  gradient  squared)  was  proposed  by  Sonneveld  [35]. 

Apart  from  the  above  three  basic  methods  there  are  a  number  of  techniques  for  non¬ 
symmetric  problems  that  are  mathematically  equivalent  to  solving  the  normal  equations 
AtAx  =  ATb  or  AATy  =  6  by  the  conjugate  gradient  method.  We  will  comment  that  these 
methods  are  often  too  quickly  dismissed  as  inferior  because  of  the  fact  that  the  condition 
number  of  the  original  problem  is  squared.  For  problems  that  are  strongly  indefinite  they 
do  represent,  however,  the  only  viable  alternative,  since  none  of  the  above  three  types  of 
methods  would  work  in  this  situation. 

An  important  factor  in  the  success  of  conjugate  gradient-like  methods  is  the  precondi¬ 
tioning  technique.  This  typically  consists  of  replacing  the  original  linear  system  (8)  by,  for 
example,  the  equivalent  system 

M~x  Ax  =  M~xb  (11) 

In  the  classical  case  of  the  incomplete  LU  preconditionings,  the  matrix  M  is  of  the  form 
M  —  LU  where  I  is  a  lower  triangular  matrix  and  U  is  an  upper  triangular  matrix  such  that 
L  and  U  have  the  same  structure  as  the  lower  and  upper  triangular  parts  of  A  respectively.  In 
the  general  sparse  case,  the  incomplete  factorization  is  obtained  by  performing  the  standard 
LU  factorization  of  A  and  dropping  all  fill-in  elements  that  are  generated  during  the  process. 
This  is  referred  to  as  ILU(O),  or  IC(0)  in  the  symmetric  case. 

4  Krylov  subspace  methods  for  eigenvalue  problems 

An  idea  that  is  basic  to  sparse  eigenvalue  calculations  is  that  of  projection  processes  [31]. 
Given  a  subspace  K  spanned  by  a  system  of  m  orthonormal  vectors  V  =  [vi,...,vm]  a 
projection  process  onto  K  =  span  {V}  computes  an  approximate  eigenpair  A  G  <D,u  6  K 
that  satisfy  the  Galerkin  condition, 

(A  -  A/)u  1  K  (12) 

The  approximate  eigenvalues  A  are  the  eigenvalues  of  the  m  x  m  matrix  C  =  VTAV. 
The  corresponding  approximate  eigenvectors  are  the  vectors  u,  =  V y,  where  y,  are  the 
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eigenvectors  of  C.  Similarly,  the  approximate  Schur  vectors  are  the  vector  columns  of 
VU,  where  U  =  [ui,u3, . . .  ,um]  are  the  Schur  vectors  of  C,  i.e.,  UTCU  is  quasi-upper 
triangular.  Thus,  one  possible  method  for  computing  eigenvalues/  eigenvectors  of  large 
sparse  matrices  is  to  use  the  Araoldi  process  [1,  28]  which  is  a  projection  process  onto 
Km  =  span{i>i,  At^, . . . ,ytm-1ti1}.  Once  the  Araoldi  vectors  have  been  generated 

we  can  use  Vm  for  a  projection  process  onto  Km.  The  matrix  V£AVm  which  is  needed  for 
this  purpose  is  nothing  but  the  upper  Hessenberg  matrix  Hm  generated  by  the  algorithm. 

Note  that  the  Araoldi  algorithm  utilizes  the  matrix  A  only  to  compute  successive  matrix 
by  vector  products  w  =  Av,  so  sparsity  can  be  exploited.  As  m  increases,  the  eigenvalues 
of  Hm  that  are  located  in  the  outermost  part  of  the  spectrum  start  converging  towards 
corresponding  eigenvalues  of  A.  However,  the  difficulty  with  the  above  algorithm  is  that  as 
m  increases  cost  and  storage  increase  rapidly.  One  solution  is  to  use  the  method  iteratively: 
m  is  fixed  and  the  initial  vector  v\  is  taken  at  each  new  iteration  as  a  linear  combination 
of  some  of  the  approximate  eigenvectors.  Moreover,  there  are  several  ways  of  accelerating 
convergence  by  preprocessing  t>i  by  a  Chebyshev  iteration  before  restarting,  i.e.,  by  taking 
vt  =  tu(A)z  where  z  is  again  a  linear  combination  of  eigenvectors. 

A  technique  related  to  Arnoldi’s  method  is  the  nonsymmetric  Lanczos  algorithm  [24,  12] 
which  produces  a  nonsymmetric  tridiagonal  matrix  instead  of  a  Hessenberg  matrix.  Unlike 
Arnoldi’s  process,  this  method  requires  multiplications  by  both  A  and  AT  at  every  step.  On 
the  other  hand  it  has  the  big  advantage  of  requiring  little  storage  (5  vectors).  Although  no 
comparisons  of  the  performances  of  the  Lanczos  smd  the  Araoldi  type  algorithms  have  been 
made,  the  Lanczos  methods  are  usually  recommended  whenever  the  number  of  eigenvalues 
to  be  computed  is  large. 

Finally,  if  the  matrix  is  banded  an  efficient  solution  is  the  shift  and  invert  strategy  which 
consists  of  using  one  of  the  above  iterative  methods  (subspace  iteration,  Araoldi,  or  Lanczos) 
for  the  matrix  (A-<r/)~l,  where  a  is  some  shift  chosen  say  at  the  center  of  some  small  region 
of  the  complex  plane  where  eigenvalues  are  sought.  The  matrix  (A  —  need  not  be 

explicitly  computed:  all  we  need  is  to  factor  (A  —  al)  into  LU  and  subsequently  at  each 
step  of  the  iterative  method  solve  two  triangular  systems  one  with  L  and  the  other  with  U. 
Thus  band  structure  can  be  fully  exploited.  In  [25]  several  implementations  of  the  shift  and 
invert  strategy  are  considered  and  the  problem  of  avoiding  complex  arithmetic  when  A  is 
real  is  addressed. 

5  Approximation  to  eAv  and  applications 

Computing  approximations  to  the  exponential  of  a  matrix  is  usually  not  too  hard  a  problem 
for  small  dense  matrices.  For  large  matrices,  this  can  become  a  rather  challenging  task  be¬ 
cause  of  the  fact  that  eA  is,  in  general  a  dense  matrix  even  when  A  is  very  sparse.  Fortunately, 
in  realistic  applications  it  is  often  not  the  exponential  of  the  matrix  that  is  sought  but  rather 
the  product  of  this  exponential  with  some  vector  v.  The  question  of  approximating  eAv  for 
any  given  vector  v  was  considered  in  [17]  where  polynomial  and  rational  approximations  to 
the  exponential  were  used.  Here  we  summarize  only  the  method  proposed  in  [17]  that  is 
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based  on  polynomial  approximation  to  eAv.  The  desired  approximation  to  eAv  is  expressed 
in  the  form, 

eAv  m  Pm-i(A) v  (13) 

where  pm-\.  is  a  polynomial  of  degree  m  —  1.  Thus,  the  vector  on  the  right-hand-side  of  (13) 
is  an  element  of  the  Krylov  subspace  (1)  and  it  is  convenient  to  express  it  in  the  orthonormal 
basis  Vm  =  vs, . . . , vm]  generated  by  Amoldi’s  algorithm  seen  earlier.  Therefore  we 

will  write  the  desired  approximation  xm  =  pm_l(A)v  as  xm  =  Vmy  where  y  is  an  m-vector. 
There  remains  to  choose  the  unknown  y.  In  [17],  the  choice  y  =  0eBme i  with  0  =  ||t/||2  was 
suggested,  leading  to  the  following  formula, 

,  eAv  &pVmeHmei  (14) 

The  quality  of  this  approximation  was  also  analyzed  in  [17]  and  the  following  result  was 
shown. 

Theorem  5.1  Let  A  be  any  square  matrix  and  let  p  =  ||i4||2.  Then  the  error  of  the  approx¬ 
imation  (14)  is  such  that 

\\eAv  —  0VmeBme1\\i  <  20^-—-.  (15) 

Till 

Experiments  reported  in  [17],  reveal  that  the  approximation  (14)  can  be  quite  accurate  even 
for  moderate  values  of  the  degree  m.  The  theorem  shows  convergence  of  this  approximation 
as  m  increases  to  infinity,  but  the  bound  (15)  is  not  sharp  in  general.  Note  also  that  the 
above  approximation  is  exact  when  m  =  N,  see  [17]. 

To  illustrate  the  concepts  described  in  this  sections  we  now  describe  two  applications. 
The  first  is  in  solving  parabolic  equations  and  the  second  in  handling  large  Lyapunov  matrix 
equations. 


5.1  Application  1:  parabolic  equations 

One  application  of  the  above  formulas  is  that  one  can  approximate  etAv  for  all  t  as 

etAv  «  0VmetHmei,  (16) 

This  provides  a  way  of  solving  the  model  homogeneous  ordinary  differential  equations  w  = 
—  Aw,  whose  solution  is 

w(t)  -  e~tAwo  (17) 

in  which  tv0  is  the  initial  condition. 

We  now  consider  the  following  linear  parabolic  partial  differential  equation: 


9-^t~ ■  =  £u(*»0 +  *(*)>  *  (18) 
u(0,x)  =  u0,  Vx  £  n 
u (t,x)  =  e-(x),  X  g  dfl,t  >  0. 

(19) 
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whose  solution  is  explicitly  given  by 

w(t)  =  A~7  +  e~tA(w0  -  A-1/)  (20) 

which  simplifies  to  (17)  in  the  case  of  a  homogeneous  system  (/  =  0).  Note  that  if  we  denote 
by  tii(t)  =  w(t)  —  A~lf  and  accordingly,  w0  =  w0  —  A-1/,  then  w(t)  satisfies  a  homogeneous 
system  and  we  have 

tii(t)  =  e~Mui0  (21) 

It  is  therefore  possible  to  obtain  the  solution  at  time  t  in  one  single  step,  if  desired,  by 
applying  the  matrices  A-1  and  e~tA  to  certain  vectors. 

If,  instead  of  attempting  to  compute  the  solution  at  time  t  in  one  single  step,  we  use  a 
time-stepping  procedure,  then  we  will  write  at  a  given  step  t, 

w(t  +  S)  =  e~tAw  (t).  (22) 

To  be  able  to  use  (22)  in  a  numerical  procedure,  we  need  to  be  able  to  compute  a  vector  of 
the  form  exp(— SA)v  at  every  step  of  the  procedure.  The  techniques  outlined  above  can  be 
used  for  this  purpose.  An  important  point  to  make  here  is  that  the  corresponding  procedure 
derived  from  (14)  is  essentially  an  explicit  scheme,  because  it  only  requires  evaluating  matrix 
by  vector  products. 

We  would  like  now  to  discuss  some  of  the  aspects  of  the  method  based  on  the  above 
approach.  First,  as  was  just  pointed  out  the  method  is  explicit  in  nature,  since  it  does 
not  require  any  solution  of  linear  systems  with  the  matrix  A.  The  question  that  may  be 
raised  here  concerns  its  stability.  In  fact,  there  is  no  stability  difficulty  in  the  usual  sense 
of  ODE  methods  because  of  the  very  nature  of  the  scheme.  Indeed,  assuming  that  at 
every  step  the  approximation  to  the  exact  solution  w(t)  at  equidistant  intevals  is  defined  as 
u'k+i  =  +  where  c*  is  the  error  introduced  in  the  approximation  of  the  exponential 

term  (including  arithmetic  rounding),  we  see  immediately  by  comparing  with  the  formula 
(21)  for  the  exact  solution  that  the  error  e*  at  each  step  satisfies: 

e*+i  =  e~A6ei,  +  e*  (23) 

In  other  words  the  scheme  is  unconditionally  stable,  in  presence  of  a  positive  definite  operator 
A. 


The  above  stability  property  has  been  extended  to  a  similar  scheme  proposed  in  [18]  for 
the  case  where  /  may  depend  on  time  and  A  is  independent  of  time.  Note,  however,  that 
for  this  more  realistic  case  one  must  be  careful  concerning  the  scheme  used,  as  there  are 
schemes  whose  stability  do  depend  on  the  stepsize. 

This  may  seem  to  be  in  contradiction  with  the  usual  conventional  wisdom  that  explicit 
schemes  require  very  small  time  steps  for  stability  to  be  guaranteed.  The  first  reason  why 
this  does  not  apply  here  is  that  the  argument  given  above  is  limited  to  very  specific  problems, 
namely  problems  with  constant  coefficients.  In  fact  the  exact  solution  can  be  computed  in 
one  step  provided  high  enough  order  approximation  to  the  exponential  is  used! 

The  key  point  here  is  the  possibility  of  using  high  order  approximation.  The  importance 
of  using  high  order  schemes  both  in  explicit  and  implicit  methods  has  been  emphasized  in 
a  few  recent  papers,  see  e.g.,  [36],  [27].  We  report  here  an  experiment  from  [18]  to  further 
illustrate  this  point.  The  experiment  was  performed  on  a  Cray  Y-MP/832. 

We  consider  a  three-dimensional  problem  of  the  form 


« t  =  +  ttw  -I-  z,y,z  e  (0,1) 

u  —  0  on  the  boundary 


which  is  discretized  using  17  grid  points  in  each  direction.  This  yields  a  matrix  of  size 
N  =  15s  =  3375.  The  initial  conditions  are  chosen  once  the  matrix  is  discretized,  in  such  a 
way  that  the  solution  is  known  for  all  t.  We  take 


«(0  .*<»»>>**) 


sin - sin 


*'  +  3'  +  &  n  +  1  n  +  1 


kk'ir 

»+l 


The  above  expression  is  simply  an  explicit  linear  combination  of  the  eigenvectors  of  the 
discretized  operator. 

The  goal  is  to  integrate  this  partial  differential  equation  between  t=0  and  t=0.1,  and 
achieve  an  error-norm  at  t  =  0.1  which  is  less  than  e  =  10~Ifl.  Here  by  error-norm  we 
mean  the  2-norm  of  the  absolute  error.  Both  the  dimension  m  of  the  Krylov  subspace  and 
the  time-step  At  can  be  varied.  Normally,  we  would  first  choose  a  degree  m  and  then  try 
to  determine  the  maximum  At  allowed  to  achieve  the  desirable  error  level.  However,  for 
convenience  we  have  proceeded  in  the  opposite  maimer:  we  first  select  a  step-size  At  and 
then  determine  the  minimum  m  that  is  needed  to  achieve  the  desirable  error  level. 

What  is  shown  in  Table  1  is  the  various  time  steps  chosen  (column  1)  and  the  minimum 
values  of  m  (column  2)  to  achieve  an  absolute  error  less  than  e  =  lO-10  at  t=0.1.  We 
show  in  the  third  column  the  total  number  of  matrix-by- vector  multiplications  required  to 
complete  the  integration.  The  times  required  to  complete  the  integration  on  a  Cray  Y-MP 
are  shown  in  the  fourth  column  and  the  final  2-norm  of  the  error  achieved  is  shown  in  the  5-th 
column.  The  vector  e~E"e\  was  computed  using  Pade  or  Chebyshev  rational  approximation 
to  the  exponential.  The  exact  type  of  approximation  used  in  each  case  is  indicated  in  the 
last  column  of  the  table,  with  P(k,k)  meaning  Pade  of  type  (k,k)  and  C(k,k)  meaning 
Chebyshev  of  the  type  ( k,k ).  Here  the  Chebyshev  approximation  corresponds  to  the  best 
uniform  approximation  to  e~*  on  the  positive  real  axis  as  described  in  [11]. 
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At 

a 

M-vec’s 

Time  (sec) 

)|£rror|)j 

Method 

0.5000E-04 

6 

■fKjTj'J 

0.8173E+01 

0.1957E-11 

P(2,2) 

0.1000E-03 

7 

0.4793E+01 

0.3308E-10 

P(2,2) 

0.5000E-03 

10 

0.1342E+01 

0.1800E-10 

P(4,4) 

0.I000E-02 

12 

0.7983E+00 

0.2260E-10 

P(4,4) 

0.5000E-02 

20 

Bill! 

0.2672E+00 

0.5271E-10 

P(8,8) 

0.1000E-01 

26 

260 

0.1740E+00 

0.7247E-10 

P(8,8) 

0.2000E-01 

34 

170 

0.1080E+00 

0.3236E-10 

C(  14,14) 

0.3000E-01 

39 

156 

0.9876E-01 

0.6362E-10 

C(  14,14) 

0.4000E-01 

44 

132 

0.8030 E-01 

0.4122E-10 

C(  14,14) 

0.5000E-01 

49 

98 

0.5932E-01 

0.5791E-10 

C(14,14) 

0.1000E+00 

71 

71 

0.4186E-01 

0.9993E-10 

C(14,14) 

Table  1:  Performance  of  the  polynomial  scheme  with  varying  accuracy  on  the  Cray  YMP. 

Since  the  matrix  is  symmetric,  we  have  used  a  Lanczos  algorithm  to  generate  the  v[.i 
instead  of  the  full  Arnoldi  algorithm.  No  reorthogonalization  of  any  sort  was  performed.  The 
matrix  consists  of  7  diagonals,  so  the  matrix  by  vector  products  are  performed  by  diagonals 
resulting  in  a  very  effective  use  of  the  vector  capabilities  of  the  YMP.  It  was  estimated  that 
the  average  Mflops  rate  reached  (excluding  the  calculation  of  exp{— was  around  220. 
This  is  achieved  with  virtually  no  code  optimization. 

Observe  the  tremendous  gains  in  computational  time  and  in  the  number  of  calls  to  the 
matrix  by  vector  multiplication  routine,  as  the  order  increases.  The  gain  in  time  is  nearly 
200  between  the  lowest  degree  used  (6)  and  the  highest  degree  used  (71). 

One  the  main  attractions  of  a  scheme  based  on  this  approach  is  the  high  degree  of 
parallelism  that  it  offers.  There  are  opportunities  to  exploit  parallelism  in  virtually  every 
part  of  the  algorithm.  However,  it  is  often  argued  that  the  the  loss  of  efficiency,  incurred 
by  mandatory  smaller  step-sizes,  exhibited  by  explicit  schemes  as  compared  with  implicit 
schemes  outweighs  the  benefits  of  the  high  degree  of  parallelism  permitted  by  the  explicit 
scheme.  For  simple  problems  such  as  the  one  tested  above,  the  argument  is  certainly  not 
true  because  of  the  possibility  of  using  high  order  schemes  as  described  in  this  section.  It 
remains  to  be  seen  whether  the  argument  is  still  valid  for  the  more  complex  case  where  A 
and/or  /  depend  on  time. 

The  usability  of  this  approach  has  recently  been  extended  to  the  case  of  a  non-constant 
forcing  term  /  with  very  encouraging  results.  Extensions  to  the  more  general  case  where 
A  is  also  time  dependent,  or  a  nonlinear  function  in  u>,  are  currently  under  investigation. 
Note  that  this  would  provide  alternative  ways  of  solving  time  dependent  partial  differential 
equations  by  the  method  of  lines. 
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5.2  Application  2:  the  Lyapunov  matrix  equation 

Another  direct  application  described  in  (32],  is  for  solving  the  matrix  Lyapunov  equation, 

AX  +  XAT  +  bbT  =  0,  (24) 

which  is  a  common  problem  in  the  study  of  dynamical  systems  with  single  input, 

ti  =  Au  -f  bg.  (25) 

The  exact  expression  for  the  equation  (24)  in  the  case  where  the  corresponding  system  (25) 
is  controllable  is  given  by  the  expression 

X  =  tTAbbTerATdr.  (26) 

Jo 

known  as  the  controllability  Grammian  of  the  dynamical  system.  In  [32]  the  solution  X 
as  provided  by  (26)  was  approximated  by  replacing  the  function  erAb  by  its  approximation 
(16).  Interestingly,  the  result  of  substituting  the  approximation  (16)  in  (26)  leads  to  an 
approximate  solution  of  the  form  Xm  =  VmGmV£,  where  Vm  is  the  matrix  of  the  Arnoldi 
rectors,  and  Gm  is  an  m  x  m  matrix  which,  incidentally,  is  the  solution  of  a  Lyapunov  matrix 
equation  involving  m  x  m  matrices.  In  fact,  a  rather  unexpected  result  shown  in  [32]  is  that 
this  approximation  provided  by  the  above  integration  process  is  mathematically  equivalent 
to  a  Galerkin  method  applied  to  (24)  over  the  subspace  of  matrices  of  the  form  VmGVZ, 
where  Vm  is  fixed  and  G  runs  over  the  set  of  m  x  m  matrices.  The  inner  product  used  for 
this  Gahrkin  process  is  defined  by  <  X,Y  >=  tr(YTX). 

Large  Lyapunov  equations  that  cannot  be  handled  otherwise  have  been  solved  in  this 
manner.  As  an  illustration,  we  now  consider  a  test  example  derived  from  the  discretization 
of  a  partial  differential  equation  of  the  form: 

^  =  *u  +  F{z,y)g(t)  (27) 

in  a  rectangular  domain,  with  Dirichlet  boundary  conditions.  Here  A  =  |^r  +  fpr  is  the 
Laplace  .n  operator.  If  we  discretize  the  rectangle  using  nx  +  2  points  in  the  *  direction  and 
n„  +  2  points  in  the  y  direction,  the  above  equations  lead  to  a  matrix  problem  of  the  form: 

u  =  Au  +  bg  (28) 

where  A  is  square  of  dimension  N  =  nmnv.  In  this  experiment  we  took  nm  =  20  and  n„  =  40 
leading  ;o  a  matrix  of  size  800.  We  have  taken  b  to  be  simply  ei  the  first  column  of  the 
identity  Tests  with  other  choices  for  6  showed  similar  results.  First  we  would  like  to  show 
the  beh  vior  of  the  residual  achieved  by  the  Krylov  subspace  method  outlined  above,  as  the 
degree  m  varies.  Table  2  shows  the  scaled  Frobenius  norm  of  the  residual,  i.e.,  the  quantity 
||  AXm  +  XmAT  +  bP  ||f,  where  ||Z||F  =  {tr[ZT  This  is  done  for  m  =  5, 10, 15,20. 

We  have  used  the  Arnoldi  process  instead  of  the  Lanczos  algorithm  on  purpose,  despite 
the  fact  that  the  matrix  is  symmetric,  in  order  to  give  an  idea  of  the  behavior  in  the  more 
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a 

\\fe»\\r 

Time  (sec) 

El 

1.10  E-04 

0.18 

m 

5.40  E-06 

0.23 

7.92  E-07 

0.35 

m 

1.92  E-07 

0.45 

Table  2:  Performance  of  the  Krylov  subspace  method  for  the  Lyapunov  matrix  equation. 

general  situation.  The  table  indicates  that  the  accuracy  of  the  Krylov  subspace  approxima¬ 
tion  to  the  Lyapunov  equation  is  good  for  very  small  m  and  then  improves  slowly.  The  times 
repotted  in  this  table  are  in  seconds  on  the  Ardent  Titan  with  two  processors  and  have  been 
obtained  using  the  -03  compiling  option.  For  details  and  comparisons  with  other  techniques 
the  reader  is  referred  to  [32]. 

6  Nonlinear  Krylov  subspace  methods 

This  section  gives  an  overview  of  some  basic  techniques  based  on  Krylov  subspaces  for  solving 
systems  of  nonlinear  equations.  We  start  by  discussing  the  various  ways  in  which  general 
nonlinear  projection  methods  can  be  defined. 

6.1  Nonlinear  projection  methods 

There  are  several  ways  of  generalizing  the  standard  Galerkin  or  Petrov  Galerkin  methods 
to  nonlinear  equations.  For  example,  Marion  and  Temam  [23]  define  nonlinear  Galerkin 
methods  by  projecting  the  original  equations  onto  nonlinear  manifolds  instead  of  linear 
subspaces.  Our  approach  is  more  conventional  in  that  we  still  use  linear  subspaces  but  may 
impose  nonlinear  Galerking  conditions. 

Consider  the  nonlinear  system 

F(u)  =  0,  ^29) 

where  F  is  a  nonlinear  function  from  IR^  to  IR^.  At  each  iteration  of  a  general  nonlinear 
projection  method  we  select  a  (linear)  subspace  K  and  we  seek  an  approximate  solution  to 
(29)  of  the  form  u  +  8  where  8  belongs  to  the  subspace  K  and  u  is  the  current  iterate.  Note 
that  the  subspace  K  changes  at  every  step  of  the  nonlinear  iteration.  The  standard  case 
examined  in  [8]  is  when  A-  is  a  Krylov  subspace  associated  with  the  Jacobian  of  F  at  the 
current  iterate.  The  various  nonlinear  projection  methods  we  consider  differ  in  the  way  the 
vector  8  is  chosen  in  the  subspace. 

A  natural  choice  for  the  next  iterate  is  to  select  a  vector  8  in  K  such  that 

/(«  +  «)  a  i||F(t.  +  8)\\]  (30) 
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U  minimized.  Although  this  is  a  nonlinear  least  squares  problem,  from  a  practical  point  of 
view  it  is  much  easier  to  solve  than  the  original  problem  when  the  dimension  m  of  K  is 
much  smaller  than  N.  The  motivation  for  this  approach  is  that  one  can  exploit  a  number  of 
highly  efficient  packages,  such  as  MINPACK,  or  NL2S0L,  for  solving  least  squares  problems 
of  small  dimension,  such  as  (30). 

Let  V  =  [vi,t>j, ...,  vm]  be  an  N  x  m  matrix  whose  column  vectors  represent  an  orthonor¬ 
mal  basis  of  the  subspace  K  and  write  S  as 

S  =  Vy,  (31) 

where  y  is  an  m-vector.  The  function  (30)  to  be  minimized  becomes  a  function  of  y  defined 
by 

s(y)  =  ^(«  +  vy)lll  (32) 

The  gradient  of  this  function  at  y  is  given  by 

Vg(y)  =  vr  J(u  +  VyfF(u  +  Vy)  (33) 

where  J(x)  is  the  Jacobian  of  F  at  the  point  x  €  IR^.  Notice  that  the  gradient  of  /  is 
V/(u)  =  J(u)TF(u)  and  so  we  have  the  simple  relation  Vy(y)  =  VrTV/(u  +  Vy). 

A  necessary,  but  not  always  sufficient,  condition  for  y *  to  be  a  minimum  of  (32)  is  that 
the  gradient  of  g  at  y*  vanishes,  i.e.,  we  must  have 

VTJ{u  +  Vy')rF(u  +  Vy*)  =  0  (34) 

This  suggests  simply  solving  the  equations, 

(J(u  +  Vy)V)TF{u  +  Vy)  =  0  (35) 

as  a  means  for  finding  a  minimizer  of  (32),  although  we  know  that  the  set  of  solutions  of 
(35)  is  larger  than  the  set  of  minimizers  of  (32).  We  refer  to  the  above  system  of  nonlinear 
equations  as  the  set  of  normal  equations  for  minimizing  (32). 

When  solving  the  above  normal  equations  the  Jacobian  must  be  reevaluated  at  each  new 
iterate  and  this  may  be  uneconomical.  An  alternative  is  to  freeze  J(u  -f  Vy)V  to  be  the 
system  of  vectors  computed  at,  say,  y  =  0  and  solve  the  set  of  modified  equations: 

(J(u)y)TF(u  +  Vy)  =  0  (36) 

This  is  a  particular  case  of  the  Petrov-  Galerkin  condition 

WtF(  u  +  Vy)  =  0,  (37) 

where  W  is  an  N  x  m  matrix.  Two  particular  cases  are  noteworthy: 

1.  W  =  V  which  corresponds  to  the  Galerkin  case. 

2.  W  =  JV  which  was  naturally  derived  above; 
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When  F  is  linear  the  first  case  corresponds  to  the  conjugate  gradient  method  if  the  coefficient 
matrix  is  symmetric  and  Amoldi’s  method  otherwise.  The  second  case  corresponds  to  the 
class  of  methods  based  on  minimizing  the  residual  norm,  a  few  representatives  of  which  are 
ORTHOMIN,  GCR,  GMRES,  see  [34]  for  details. 

Finally,  one  may  linearize  F{u  +  Vy)  in  (37)  around  u  and  derive  fully  linearized  tech¬ 
niques  which  correspond  to  solving  the  linear  system, 

WT[F(ti)  +  J(u)Vy]  =  0  (38) 

where  J(u)  is  the  Jacobian  of  F  at  the  current  iterate  u.  The  above  linear  system  is  m- 
dimensional  and  will  admit  a  unique  solution  if  the  section  WTJ( u)V,  which  is  an  m  x  m 
matrix,  is  nonsingular.  In  the  particular  case  where  W  =  JV  this  means  that  the  columns 
of  JV  must  be  linearly  independent.  Note  that  (38)  represents  one  way  of  approximately 
solving  the  Newton  system 

F(u )  +  J(u)6  =  0  (39) 

at  every  step  of  Newton’s  method.  Thus,  the  fully  linearized  techniques  are  a  particular  case 
of  a  class  of  methods  that  are  commonly  referred  to  as  inexact  Newton  methods  and  have 
been  studied  in  the  literature,  (  see,  e.g.,  [14,  6,  26,  8,  7]). 

6.2  Globally  convergent  nonlinear  Krylov  methods 

In  this  section  we  only  consider  the  fully  linearized  methods  in  the  sense  defined  above.  To 
guarantee  global  convergence,  the  usual  inexact  Newton  methods  must  be  modified  in  several 
ways.  A  few  such  modifications  have  been  proposed  in  [8].  Moreover,  in  [7]  a  number  of 
convergence  results  for  these  techniques  have  been  established.  We  would  like  to  summarize 
some  of  these  results  here. 

The  simplest  modification  involves  a  backtracking  procedure.  In  this  technique,  an  iter¬ 
ate  u„  is  given  and  we  define  the  next  iterate  in  the  form  u„  +  A p„,  where  pn  is  any  descent 
direction  and  A  is  selected  by  a  procedure  which  ensures  that  the  function  /  decreases  suffi¬ 
ciently  at  each  iteration  and  that  the  iterate  makes  sufficient  progress  towards  the  solution. 
One  such  procedure  based  on  linesearch  backtracking  is  {described  below.  The  search  direc¬ 
tion  p„  is  provided  by  an  approximate  solution  to  the  Newton  system  J(un)p  =  —F(u„), 
e.g.,  via  FOM  or  GMRES.  It  is  easy  to  show  that  p„  is  a  descent  direction  at  u„  whenever 
we  have 

||F(un)  +  J(un)pn||J<||FK)||l, 

which  means  that  the  residual  norm  for  the  Newton  system  J{u„ )p  =  —F(u„)  must  be 
strictly  reduced  from  that  associated  with  p  =  0.  In  particular,  it  is  common  to  require  that 
a  condition  of  the  form 

||F(u„)  +  ^(un)p„||2  <  ^||F(u„)||2, 

where  ij„  <  i}<  1,  in  the  context  of  iterative  methods. 

In  the  procedure  described  below  the  two  parameters  9r  m,  9„.r  are  such  that  0  <  9^*  < 
9m*x  <  I,  the  simplest  choice  being  9^n  —  9mmx  =  1/2.  The  procedure  requires  another 
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parameter  e*  >  0  which  is  used  to  essentially  rescale  the  starting  step  in  the  process  in  order 
to  prevent  it  from  from  being  too  small. 

Algorithm  3.1:  General  Backtracking  Procedure 

1.  SetA  =  max{l,<*l^*i}. 

2.  If  /(tt„  +  Apn)  <  /(un)  +  aAV/(tin)Tp„,  then  set  A„  =  A,  and  exit.  Else: 

3.  Choose  A  £  [fl^inATfl,„.,A];  set  A  «—  A.  Go  to  (2). 

The  following  theorem  [7]  is  a  general  convergence  result  for  sequences  generated  the 
above  algorithm. 

Theorem  6.1  Let  f  =  j||F|||  be  differentiable  and  assume  that  its  gradient  is  such  that 

||V/(x)  -  V/(*)||,  <  7||*  -  y[\i,  for  all  x,y  6  IR*.  (40) 

Let  p„  be  such  that  ||F„  +  <  Jj||Fn|jj  for  all  n,  with  i)  <  1.  Further,  let  each  iterate  be 

chosen  by  the  General  Backtracking  Algorithm.  Then,  either 

limf(un)  =  °  (41) 

ft— ♦  OO 

or 

Urn  ||pn||,  =  oo.  (42) 

Moreover,  superlinear  convergence  will  essentially  take  place  at  the  additional  condition 
that  ti„  — >  0.  Global  convergence  of  the  method  using  the  trust  region  model  approach  has 
also  been  analyzed  in  [7]. 

One  of  the  most  successful  ways  of  using  nonlinear  Krylov  subspace  methods  is  for  solv¬ 
ing  nonlinear  equations  in  which  the  Jacobian  of  F  is  not  available  or  is  too  expensive  to 
compute.  Note  that  the  cost  of  producing  the  Jacobian  may  well  take  into  account  the  initial 
programming  effort.  The  reason  why  we  can  still  use  the  methods  outlined  above  is  that 
Krylov  subspace  methods  do  not  require  the  Jacobian  matrix  J  explicitly,  but  only  its  action 
on  a  vector  v.  This  action  can  be  well  approximated  by  a  difference  quotient  of  the  form 

J(u)v*FiU  +  av)-F-{u>, 

where  u  is  an  approximation  to  a  solution  of  (29),  and  a  is  some  small  scalar.  The  above 
observation  has  been  exploited  in  several  papers  [37,  22,  20,  9]  to  accelerate  fixed-point 
iterations  of  the  form 

un+1  =  M(u„) 

by  applying  the  above  techniques  to  the  system  F( u)  =  u  -  Af(u)  =  0.  Typically,  the 
Jacobian  of  the  mapping  F  is  a  dense  matrix  and  it  may  be  impractical  4  mpute  it  for 
large  problems. 


491 


6.3  Application  :  equations  of  semi-conductor  device  simulation 

There  are  several  ways  of  writing  the  equations  of  semi-conductor  device  simulation.  One 
form  of  the  equations  uses  the  “quasi-Fermi  levels”  v  and  w  that  are  related  to  the  electron 
density  n  and  hole  density  p  by  n  =  eu_”  and  p  =  e“-u.  The  dimensionless  form  of  steady 
state  equations  are  as  follows, 


< 

1 

< 

II 

© 

(43) 

V.(e-“+“Vtn)  =  0, 

(44) 

V2u  +  eu~v  —  ew~u  -  ki  =  0, 

(45) 

subject  to  appropriate  (possibly  mixed)  boundary  conditions.  The  first  two  equations  rep¬ 
resent  the  continuity  equations  for  electrons  and  holes  while  the  third  is  Poisson’s  equation 
for  the  (normalized)  potential  u.  The  term  ki  is  the  doping  profile  (in  units  of  the  intrinsic 
density  of  the  semiconductor),  which  is  essentially  a  source  term.  For  more  details  about 
the  above  system  of  equations  and  the  assumptions  made  see,  e.g.,  [4,  20]. 

The  above  system  represents  a  coupled  nonlinear  system  of  three  partial  differential 
equations.  The  so-called  decoupling  algorithm  used  in  this  context  to  solve  this  system 
consists  of  a  Block  Gauss-Seidel  iteration  on  the  discrete  version  of  the  above  equations.  It 
can  be  briefly  described  as  follows.  Given  the  dimensionless  potential  u  from  the  previous 
iteration,  one  obtains  the  intermediate  variables  v  and  then  w  by  solving  (43)  and  then  (44). 
Then  the  potential  equation  (45)  is  solved  to  obtain  a  new  potential  u.  The  whole  mapping 
from  the  old  potential  u  to  the  new  potential  ti  will' be  denoted  T: 

u  =  T(u).  (46) 

Whereas  the  above  algorithm  is  very  robust  in  practice,  it  is  found  that  there  are  situ¬ 
ations  where  it  becomes  very  slow.  The  method  has  been  in  general  abandoned  in  favor  of 
the  Full  Newton  schemes  to  solve  the  coupled  system  (43)-(44)-(45),  see  reference  [4],  The 
Newton  equations  are  typically  solved  by  direct  methods.  As  simulators  are  now  starting  to 
cover  three-dimensional  models,  there  is  a  regain  of  interest  in  iterative  methods  for  solving 
the  Newton  equations. 

Another  approach  proposed  in  [20]  is  to  apply  an  acceleration  procedure  to  the  fixed 
point  iteration  u*+i  =  T(tt»).  In  other  words  we  would  like  to  solve  the  system 

u  —  T(u)  =  0. 

Note  that  the  Jacobian  of  T  is  generally  a  dense  matrix  here  and  would  be  rather  difficult 
to  compute.  However,  as  was  stressed  before  there  is  no  difficulty  using  a  procedure  such  as 
GMRES  to  compute  the  zero  of  u  —  T(u)  even  when  the  Jacobian  is  not  explicitly  available. 
This  was  implemented  and  tested  in  [20].  A  comparison  with  the  non-accelerated  version  of 
the  decoupling  algorithm  and  the  accelerated  version  showed  considerable  gains  in  speed,  a 
factor  of  about  8  in  the  example  reported.  Moreover,  GMRES  acceleration  was  also  vastly 
superior  to  two  variants  of  Chebyshev  acceleration  schemes.  This  substantial  superiority  can 
be  attributed  to  the  capacity  of  GMRES  to  take  advantage  of  the  clustering  of  the  spectrum 
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of  the  Jacobian  of  T  around  the  origin.  This  property  of  <r[Tu]  can  be  related  to  the  compact 
differentiability  of  the  continuous  mapping  T.  Like  other  conjugate  gradient  type  methods, 
GMRES  is  able  to  take  advantage  of  a  spectrum  in  which  the  rate  of  convergence  is  slowed 
down  by  a  few  isolated  eigenvalues  only.  Chebyshev  type  schemes  do  not  take  advantage  of 
any  clustering  of  the  spectrum,  but  only  the  overall  size  and  shape  of  its  convex  hull. 

7  Conclusion 

We  have  shown  several  ways  in  which  Krylov  subspaces  can  be  used  to  solve  various  types 
of  scientific  problems.  The  scope  of  application  areas  where  these  methods  can  be  used  has 
been  steadily  widening  in  recent  years.  One  might  say  that  the  method  constitutes  in  effect 
a  universal  way  of  reducing  the  dimensionality  of  the  original  problem.  Perhaps  one  of  the 
most  challenging  areas  where  the  method  can  be  used  is  in  solving  inverse  problems.  Some 
of  the  ideas  presented  in  Section  6  may  possibly  be  exploited  for  this  purpose  but  there  is 
still  much  to  be  done  in  this  direction. 
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1  Introducti.  i 


The  field  of  numerical  optimization,  one  of  the  most  challenging  areas  of  numerical  analysis,  has 
recorded  an  impressive  growth  in  the  past  few  years.  Apart  from  widely  known  developments 
in  linear  programming  (see  [23],  for  instance),  this  growth  has  also  been  fueled  by  a  steadily 
increasing  interest  in  nonlinear  problems  involving  a  large  number  of  variables.  However,  the 
importance  of  the  new  results  and  the  power  of  the  new  algorithms  developed  are  probably  not 
fully  appreciated  by  the  larger  community  of  numerical  analysts  and,  even  more  importantly, 
by  the  community  of  potential  users  of  large  scale  nonlinear  optimization  methods.  Recent 
publications  devoted  specifically  to  large  scale  problems  include  [18],  [9],  [10],  and  [2]. 

It  is  a  widely  held  view  that  only  rather  small  problems  can  be  handled  by  the  techniques 
available.  Specifying  nonlinear  optimization  problems  in  more  than  20  variables,  say,  is  still 
considered  by  many  as  a  risky  modelling  approach,  mostly  because  the  problem’s  solution  is 
likely  to  be  impossible  with  the  existing  algorithms.  It  is  true  that  this  view  was  justified  ten 
years  ago.  The  present  situation  is  however  quite  different  and  it  is  the  purpose  of  this  paper  to 
stress  this  change  and  ‘  j  present  some  of  he  concepts  that  resulted  in  this  progress. 

These  concepts  will  be  presented  here  from  the  authors’  very  personnal  (and  maybe  biased) 
point  of  view.  In  particular,  no  attempt  is  made  to  discuss  every  concept  and  significant  recent 
development  in  the  area,  rather  the  exposition  will  focus  on  two  topics  that  are  considered  to  be 
fundamental  by  the  authors.  We  will  also  outline  our  present  and  forthcoming  research  in  this 
area,  both  from  the  algorithmic  and  software  perspective. 

The  first  part  of  this  paper  is  devoted  to  an  introduction  to  the  classes  of  partially  separable 
and  group  partially  separable  functions.  Understanding  and  exploiting  these  concepts  is,  in  our 
opinion,  central  to  the  development  of  efficient  and  reliable  methods  for  large  scale  problems,  much 
in  the  same  way  that  sparsity  is  a  key  to  large  scale  numerical  linear  algebra.  This  introduction 
is  contained  in  Section  2. 

The  second  part  of  the  paper  discusses  the  LANCELOT  software  project,  whose  purpose  is  to 
produce  a  self  contained  system  for  large  scale  nonlinear  optimization.  The  discussion  emphasizes 
the  main  objectives  of  the  system,  with  a  brief  discussion  of  some  data  input  and  implementation 
issues. 

2  The  structure  of  nonlinear  problems 

2.1  A  tutorial  on  partial  separability 

Very  few  numerical  analysts  would  contest  today  the  crucial  role  played  by  the  various  techniques 
for  exploiting  the  structure  of  a  problem  in  the  development  of  practical  computational  methods 
for  large  scale  problems.  Sparsity  in  large  systems  of  linear  equations,  domain  decomposition  in 
the  numerical  solution  of  partial  differential  equations  and  structure  in  the  interpolat  ion  equations 
for  function  approximation  are  probably  the  three  examples  that  come  first  to  mind.  The  situation 
is  entirely  similar  in  large  scale  nonlinear  optimization,  where  exploiting  the  structure  of  large 
problems  is  the  only  reasonable  way  to  tackle  their  solution. 


Introduced  in  1981,  by  Andreas  Griewank  and  the  third  author  in  [15),  the  notion  of  a  partially 
separable  function  can  be  viewed  as  a  way  to  describe  the  structure  of  a  nonlinear  function  in 
terms  of  an  underlying  geometry  (i.e.  subspaces  and  their  relations). 

It  is  the  authors’  belief  that  this  notion  can  be  very  helpful,  not  only  to  algorithm  designers, 
but  also  to  potential  users  of  these  algorithms.  Indeed,  for  a  method  to  exploit  structure,  it  is 
very  beneficial  that  the  user  describes  this  structure  in  a  way  coherent  with  the  implementation. 
This  in  turn  supposes  that  the  useT  should  be  at  least  moderately  familiar  with  the  principles  of 
the  description  used. 

The  concept  of  partial  separability  will  be  introduced  and  analysed  by  a  simple  yet  meaningful 
example:  the  discretised  minimum  surface  problem  over  the  unit  square.  We  will  not  be  interested 
in  this  problem  as  such,  but  rather  in  showing  that  one  of  its  discretised  formulations  exhibits 
the  type  of  structure  that  we  want  to  exploit. 

The  minimum  surface  variational  problem  is  well  known,  and  consists  in  finding  the  surface  of 
minimum  area  that  interpolates  a  given  continuous  function  on  the  boundary  of  the  unit  square. 
The  unit  square  itself  is  discretised  uniformly  into  mJ  smaller  squares,  as  shown  in  Figure  1. 
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Figure  1:  Discretisation  of  the  unit  square 


The  variables  of  the  problem  will  be  chosen  as  the  “height”  of  the  unknown  surface  above  the 
(m  +  1)J  vertices  of  the  m3  smaller  squares.  The  area  of  the  surface  above  the  (»,  j)-th  discretised 
square  is  then  approximated  by  the  formula 


where  the  squares  and  variables  are  indexed  as  shown  in  Figure  1.  The  precise  justification  of 
this  formula  does  not  matter  here,  but  we  concentrate  instead  on  its  form. 

The  variables  in  the  sets 


m+1 


{*i 


m+l 


\m  +  l 
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correspond  to  the  boundary  conditions  and  are  assigned  given  values,  while  the  others  are  left 
&ee.  The  complete  problem  is  then  to  minimise  the  objective  function 

m 

/(*)  —  ^  ^  *i+l  jt 

tj=l 

over  all  the  free  variables1. 

A  first  analysis  quickly  shows  that,  using  a  row-wise  ordering  of  the  variables,  the  Hessian 
V,/(r)  has  a  block-tridiagonal  sparsity  pattern  with  tridiagonal  blocks.  We  note  that  storing  this 
matrix,  which  is  typically  required  in  one  form  or  the  other  in  most  efficient  algorithms,  therefore 
requires  storing  0[5(m  +  1)J]  real  numbers.  More  importantly,  this  sparsity  structure  may  be 
discovered  by  considering  the  Hessian  of  s,' j  as  a  function  of  Xij,  *»+ ij,  xi,j+ 1  an(^  *«+ij+i>  and 
then  by  assigning  -  ach  of  the  rows  and  columns  of  this  dense  4x4  matrix  to  the  relevant  row 
and  column  of  the  larger  V2f(z).  If  we  consider  each  s{J  as  a  function  of  the  complete  set  of 
(m  +  l)1  variables,  denoted  by  s;j(z),  we  see  that  it  satisfies  the  important  property  that 

*ij(x)  =  *iA*  +  “') 


for  all  vectors  w  in  the  invariant  aubspace 


N‘j  =  {x  e  =  lUjj+i  =  Wi+ij  =  ’"i+u+j  -  0}.  (5) 

Each  Sij  is  therefore  invariant  with  respect  to  all  translations  corresponding  to  vectors  of  N'j . 
From  this  invariance,  it  is  again  easy  to  deduce  that  V*s,jJ(x)  is  a  (very)  sparse  matrix,  but 
we  stress  the  point  that  (4)-(5)  is  in  fact  the  most  important  observation  when  analysing  the 
structure  of  the  second  derivatives  of  our  model  problem.  We  may  then  interpret  the  sparsity 
pattern  of  this  last  matrix  as  a  consequence  of  the  problem  structure:  sparsity  is  easily  derived 
from  the  structure,  but  the  reverse  is  not  true. 

If  we  now  examine  the  function  Sij  in  more  detail,  we  easily  see  that,  instead  of  being  a 
function  of  the  four  variables  Xfj,  *ij+ 1  and  *i+i,y+l,  it  is,  in  fact,  a  function  of  the  two 

internal  variables 

Uij  Zij  —  *i+i  j+i  and  Vjj  =  *t,y+i  —  (®) 


If  we  write  the  simple  linear  transformation 

*».i  ' 

/  1  0  0  -1  \  Xi.i+i 

V  0  1  -1  0  )  x.+ij 

\  *«+i,i+i  t 

we  can  then  reformulate  j  in  terms  of  these  internal  variables  as 


(7) 


(8) 


where  the  new  function  i*  easily  derived  from  (1)  and  is  given  by 


i  *<j) 


'Various  obstacle  problems  eaa  also  be  obtained  by  specifying  suitable  bounds  on  the  Tsriables. 


(9) 
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Computing  now  the  gradient  and  Hessian  of  ijj  with  respect  to  its  two  arguments,  we  obtain 


(10) 

where  the  matrix 

/  1  0  0  -1  \ 

(11) 

w={o  1  -1  oj- 

and  also  that 

=  VJsj>j(xij,xi+1j,xi|j+1,xi+1j+1), 

(12) 

where  is  again  considered  as  a  function  of  four  variables.  But  the  Hessian  of  iij  is  now  a 
2x2  symmetric  matrix,  while  that  of  nj  is  4  X  4!  Furthermore,  we  did  not  index  the  matrix 
W,  because  the  same  linear  transformation  holds  for  all  values  of  t  and  j.  Storing  the  second 
derivatives  of  our  problem  therefore  requires  storing  W  once,  plus  storing  the  m?  Hessians  of 
the  functions  »ij,  which  amounts  to  0[3(m  +  1)J]  real  numbers.  This  is  a  substantial  reduction 
compared  to  the  0[5(m  +  l)1]  required  by  the  more  classical  “sparsity  oriented”  approach. 

In  order  to  abstract  from  these  observations,  we  simply  note  that  (4)  holds  not  only  for  the 
vectors  tu  in  N'j,  but  for  all  w  in 

NiJ  =  N'j  +  {to  6  R.(m+1)>|u>,j  =  Wi+ij+i  and  wiJ+I  =  (13) 

We  can  then  use  all  the  above  remarks  to  define  partially  separable  functions  as  follows. 

We  say  that  /(x)  is  partially  separable  if  and  only  if 

1.  it  can  be  written  as  a  sum  of  element  functions,  that  is 

/(*)  *£/<(»),  (14) 

»= 1 

2.  each  of  these  element  functions  has  a  nontrivial  invariant  subspace,  that  is  if,  for  each 
*  G  (1, . .  p),  there  exists  a  subspace  N{  £  {0}  such  that,  for  every  x  £  R"  and  w  £  Ni,  we 
have  that 

A(*)  =  /.(*  +  w)-  (15) 

This  definition  is  not  the  most  general  one  (see,  for  example  [15]),  but  has  the  advantage  of 
being  rather  intuitive.  As  in  the  minimum  surface  example  above,  we  will  be  mostly  interested 
in  the  case  where  the  dimension  of  the  invariant  subspaces  Ni  is  large  compared  to  the  total 
dimension  of  the  problem.  We  may  then  set  up  a  linear  transformation  for  each  element,  that 
transforms  the  problem  variables  {z,}"=1  into  internal  variables  for  the  element,  say,  as 

expressed  by  the  relation 

u  =  Wi*.  (16) 

In  our  example,  this  transformation  matrix  is  obtained  by  compos:ng  the  four  variables  at  the 
comers  of  the  considered  discretised  square  with  (11). 

Once  the  transformation  from  problem  variables  to  internal  ones  is  defined,  it  is  only  necessary 
to  store,  compute  and/or  update  the  derivatives 

VA(«i)  and  V*/i(«),  (17) 
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whose  dimensions  are  small. 

Moreover,  the  situation  arising  in  our  model  example  is  very  common.  The  transformation 
can  be  separated  into  the  product  of  two  distinct  operators:  the  first  selects  a  (usually  small) 
number  of  problem  variables  that  explicitly  appear  in  a  given  element,  and  the  second  defines  a 
further  linear  transformation  of  these  selected  variables  yielding  the  internal  ones.  We  note  that 
the  first  of  these  operators  need  not  to  be  expressed  in  terms  of  a  matrix,  but  merely  in  terms 
of  a  list  of  problem  variables  associated  with  the  considered  element.  This  saves  a  substantial 
amount  of  matrix  manipulation.  As  in  our  example,  it  also  frequently  happens  that  the  second 
of  these  transformations,  that  is  from  the  selected  problem  variables  to  the  internal  ones,  is 
independent  of  the  element  under  consideration.  Otherwise,  frequently  there  are  only  a  few 
different  such  transformations1,  which  again  saves  storage  and  computation. 

Since  the  invariant  subspaces  are  not  necessarily  spanned  by  vectors  of  the  canonical  basis 
(as  in  our  example),  we  see  that  the  notion  of  partial  separability  extends  that  of  sparsity. 

More  formally,  we  may  state  the  following  result. 

Theorem  1  Every  twice  continuously  differentiable  function  from  Rn  into  R.  having  a  sparse 
Hessian  matrix  is  partially  separable. 

The  reader  is  referred  to  [15,  section  2]  for  a  more  detailed  discussion  of  this  basic  property. 
Finally,  it  may  be  worthwhile  to  note  that  separable  functions,  that  is  functions  of  the  form 

/(*)  =  £/<(*<).  (18) 

i=l 

are  clearly  a  very  restricted  case  of  partially  separable  functions.  Because  the  variables  x,  must 
all  be  different  in  (18),  we  prefer  to  call  these  functions  totally  separable. 

2.2  Group  partial  separability 

A  significant  proportion  of  practical  large  optimization  problems  exhibit  another  very  important 
structure:  the  assignment  of  sets  of  element  functions  to  groups.  The  most  typical  and  pervasive 
example  is  probably  that  of  the  least-squares  problem,  where  element  functions  are  gathered  into 
groups  which  are  then  squared. 

Grouping  nonlinear  elements  into  sets  is  also  desirable  if  we  consider  solving  constrained 
problems:  it  is  indeed  necessary  to  distinguish  the  element  functions  associated  with  the  objective 
from  those  associated  with  the  constraints. 

In  order  to  achieve  this  grouping,  we  have  to  extend  the  notion  of  partial  separability  and 
define  a  slightly  more  general  class.  We  will  say  that  the  real  function  f(x)  is  a  group  partially 
separable  function  if  and  only  if  it  is  of  the  form 

/(*)  =  <19) 

;=t 

aIn  a  finite  element  application,  for  instance,  there  are  as  many  transformation!  u  distinct  element  types  in  the 
problem. 


302 


where  the  group  functions  gj  are  twice  continuously  differentiable  functions  from  R  into  itself, 
and  where  their  arguments  hj(z)  are  partially  separable  functions  from  Rn  into  R.  Expanding 
the  hj,  we  obtain  the  expression 

«■> -£*»[«•> +£*(•)].  (20) 

i=l  \  «€J,  / 

where  l,(ae)  is  the  linear  part  of  hj(z),  if  any:  the  element  functions  Ji(z)  (i  €  Jj)  now  contain 
the  purely  nonlinear  part  of  the  j-th  group. 

Our  model  problem  of  the  previous  subsection  also  exhibits  this  structure.  Indeed,  we  can 
choose  the  group  functions  as 

RfW  =  ^s/v<  (21) 

(where  y  is  called  the  group  variable),  the  linear  part  of  the  groups  as 

»«(*)  =  1  (22) 

and  the  two  nonlinear  element  functions  of  the  (i,  j)-th  group  as 

A  =  A  =  ^(*<,j+i  -  (23) 

where  we  used  again  our  double  indexing  convention  for  the  group  indices.  In  the  same  spirit  as 
above,  each  of  these  element  functions  clearly  has  two  elemental  variables  and  only  one  internal. 

An  algorithm  capable  of  minimising  group  partially  separable  functions  is  a  very  powerful  tool, 
as  it  can  be  applied,  without  any  modification,  to  nonlinear  least-squares  problems,  constrained 
problems  (in  particular,  to  their  (augmented)  Lagrangian  formulations)  and  many  other  cases. 
Such  an  algorithm,  called  SBMIN,  has  been  developed  by  the  authors  in  the  context  of  the 
LANCELOT  project  that  is  presented  below. 

Clearly,  we  can  interpret  the  use  of  group  partial  separability  as  an  additional  step  (compared 
to  partial  separability  alone)  in  the  exploitation  of  the  computational  tree  associated  with  a  given 
real  function  of  several  variables,  and  then  wonder  if  one  more  step  could  not  bring  further 
advantages.  The  resulting  procedure  would  then  become  closer  and  closer  to  the  exoloitation 
of  the  complete  tree,  as  advocated  by  McCormick  and  co-workers  in  the  concept  of  factorable 
functions  (see  [19],  instance).  A  complete  discussion  of  the  relative  merits  of  partial  vs 
complete  exploitation  of  the  computational  tree  associated  with  a  given  problem  is  outside  the 
scope  of  the  present  paper,  but  should,  at  least,  take  the  following  arguments  into  account. 

•  There  is  a  difference  between  using  the  complete  computational  graph  [14]  for  efficient 
calculation  of  various  quantities  (for  example,  derivatives)  that  are  requested  by  a  given 
algorithm,  and  using  the  complete  tree  in  the  algorithm  itself. 

The  first  approach  is  used  by  the  new  promising  automatic  differ  mtiation  algorithms,  as 
discussed  in  [14],  while  the  second  rsuses  the  question  of  the  possible  use  of  derivatives  of 
an  order  higher  than  two.  This  very  interesting  approach  has  been  taken  by  R.  Schnabel 
and  co-authors  (see  [22]  for  an  introduction  to  the  subject),  but  the  applications  have  been 
restricted  to  small  problems  and  specific  subsets  of  the  higher  derivatives.  Whether  or  not 


503 


this  type  of  techniques  can  be  extended  to  large  problems  and  complete  higher  derivatives 
(and  whether  this  is  desirable)  remains  an  open  question:  if  a  Hessian  matrix  is  large,  a 
third  order  derivative  tensor  is  huge...  but  again  structure  might  play  an  important  role 
here. 

•  Storage  being  a  factor  of  importance  for  large  problems,  one  has  to  reach  a  compromise 
between  storage  needs  and  efficiency  for  a  given  algorithm. 

The  storage  required  for  an  algorithm  using  group  partial  separability  is  of  the  same  order  as 
that  required  for  a  specialised  nonlinear  least  squares  solver,  and  using  specialised  software 
for  this  last  application  is  widely  recommended.  This  indicates  that  the  balance  between 
storage  and  efficiency  is  reasonably  achieved  in  our  context. 

•  A  number  of  problems  involve  “black  boxes”,  that  is  user  supplied  routines  for  computing 
some  problem  dependent  numerical  functions,  for  which  the  structure  is  unknown.  Al¬ 
though  it  may  be  possible  in  the  future  to  process  these  routines  using  a  specially  designed 
“precompiler”  that  would  extract  information  about  their  underlying  structure,  it  is  still 
necessary  for  today’s  algorithms  to  use  these  black  boxes  as  they  stand.  Furthermore,  if 
these  black  boxes  link  some  specific  subsets  of  variables  together,  this  property  should  be 
expressed  in  the  structural  description  of  the  complete  problem.  This  can  be  handled  very 
naturally  in  the  (group)  partially  separable  framework  by  identifying  such  black  boxes  with 
element  functions. 

2.3  Structure  and  new  computer  architectures 

An  important  issue  in  the  design  of  algorithms  for  large  scale  problems  is  their  potential  use  of 
advanced  computer  architecture.  The  question  is  already  important  for  small  problems  (see  [22]), 
but  is  even  more  so  for  large  ones,  because  the  amount  of  calculation  that  is  purely  internal  to 
the  algorithm  (therefore  excluding  problem  functions  evaluation)  rises  significantly  and  may  well 
become  the  dominant  computational  cost. 

The  exploitation  of  partial  separability  and  group  partial  separability  on  parallel  computers 
is  quite  straightforward  and  efficient  (see  [17]  and  [16]).  The  overall  idea  is  very  simpK  We  note 
that  the  computational  tasks  in  an  algorithm  using  partial  separability  are  of  three  types: 

functions  and  derivatives  evaluations  i  Because  of  the  partial  separable  structure,  all  the 
element  functions  are  independent,  and  their  evaluation  can  be  spread  over  the  available 
processors  in  a  purely  asynchronous  manner,  an  ideal  situation  in  parallel  computing. 

internal  linear  algebra  i  This  is  one  of  the  areas  where  the  use  of  parallel  computers  have  been 
most  studied  and  for  which  efficient  algorithms  are  now  available.  At  the  k- th  iteration  of 
a  typical  Newton-type  method,  partially  separable  problems  give  rise  to  linear  systems  of 
the  form 

VV(**)«*  =  -V/(«fc),.  (24) 

From  the  linear  algebra  point  of  view,  the  structure  of  such  systems  is  extremely  similar 
to  that  of  finite  element  systems,  as  the  coefficient  matrix  is  the  sum  of  element  Hessians 
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which  must  ultimately  be  assembled.  Efficient  algorithms  for  this  class  of  problems  are  well 
studied  and  developed  (see  [1]  and  [12]  for  instance).  Both  iterative  methods  (eg.  conjugate 
gradients  variants)  or  direct  algorithms  (eg.  multifrontal  techniques)  have  been  applied  to 
large  scale  partially  separable  optimization  problems  in  this  context  (see  [17]  and  [6]). 

internal  element  handling  and  updating  :  These  are  the  algorithm’s  inner  calculations  that 
handle  the  element  functions  (including  the  Hessian  approximation  update  for  quasi-Newton 
methods).  Again,  these  can  be  shared  by  the  available  processors  because  of  the  indepen¬ 
dence  of  these  functions. 

Of  course,  if  one  restricts  one’s  attention  to  sparsity  of  the  Hessian  matrix,  important  speedups 
can  still  be  achieved  in  internal  linear  algebra,  as  discussed  above,  but  parallelisation  of  the  two 
other  types  of  computational  tasks  is  then  more  difficult. 

As  a  conclusion,  we  may  say  that  the  use  of  partial  and  group  partial  separability  facilitates 
the  algorithms  to  use  the  potentialities  of  parallel  computers,  providing  a  natural  problem  de¬ 
composition,  which  then  results  in  an  efficient  partitioning  of  the  computational  work  amongst 
the  processors. 

2.4  The  sources  of  (group)  partially  separable  problems 

One  of  the  major  sources  of  (group)  partially  separable  is  the  discretisation  of  continuous  prob¬ 
lems.  Both  finite  differences  and  finite  elements  approximations  result  in  problems  of  this  type, 
mostly  because  of  the  “locality"  of  the  involved  operators,  that  only  relate  variables  correspond¬ 
ing  to  “neighbouring”  discretisation  points.  An  interesting  collection  of  such  problems  has  been 
recently  gathered  by  J.  Mor£  in  [20].  This  collection  features,  amongst  others,  chemical  engineer¬ 
ing  applications,  variational  inequalities,  biomedical  modelling,  boundary  value  problems  and 
elasticity  analysis. 

Nonlinear  network  problems  form  another  important  source  of  group  partially  separable  prob¬ 
lems,  ranging  from  urban  traffic  equilibria  (see  [13]  for  a  excellent  survey  of  this  type  of  appli¬ 
cations)  to  water  and  gas  resource  management  [21].  The  structure  again  results  from  the  same 
“locality”  property  that  we  mentioned  for  discretised  problems:  nonlinear  variables  explicitly  in¬ 
teract  when  they  are  close  to  each  other  in  the  considered  network  (they  are  typically  associated 
to  arcs  incident  to  a  given  node,  or  to  nodes  at  the  extremities  of  a  given  arcs). 

Other  classes  of  problems  that  often  exhibiting  partially  and/or  group  partially  separable 
structure  include 

•  multiperiod  planning  models, 

•  input-output  macro-economic  models, 

•  multiobjcctive  optimization, 

•  nonlinear  matrix  equations. 

It  seems  therefore  fair  to  say  that  (group)  partially  separable  problems  actually  occur  in  most 
fields  where  large  scale  nonlinear  optimization  is  itself  relevant.  This  is  not  surprising  if  one 
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recalls  Theorem  1,  but  it  is  worthwhile  to  note  that  the  decomposition  (14)  or  (19)  often  arises 
very  naturally  in  the  problem  formulation,  its  description  therefore  requiring  a  minimal  amount 
of  additional  effort. 

3  The  LANCELOT  software  project 

The  LANCELOT  software  project  was  started  by  the  authors  more  than  two  years  ago.  The 
meaning  of  the  LANCELOT  acronym  is  explained  by  the  banner  displayed  in  Figure  2. 

- L  A  N  C  E  L  0  T  - - 

anooxape 
r  d  n  n  t  g  t  c 

g  I  s  e  r  i  h 

e  i  t  n  a  m  n 

n  r  d  n  i  i 

e  a  e  g  2  q 

a  i  d  i  a  u 

r  n  ate 

e  n  i  s 

d  o 
n 

Figure  2:  The  LANCELOT  banner 


According  to  this  banner,  LANCELOT 's  purpose  is  to  attack  large  problems  involving  non¬ 
linear  objectives  and/or  constraints  by  using  techniques  based  on  the  Lagrangian  function3. 

The  main  characteristics  of  the  LANCELOT  software  can  be  described  as  follows. 

Use  of  problem  structure:  As  discussed  in  tha  first  part  of  the  paper,  the  use  of  problem 
structure  is  the  only  reasonable  way  to  tackle  large  problems.  For  the  reasons  explained 
above,  (group)  partial  separability  seems  the  right  concept  to  invoke  for  this  purpose: 
LANCELOT  will  therefore  explicitly  handle  these  types  of  structure. 

On  the  other  hand,  this  capacity  to  exploit  structure  will  inevitably  result  is  some  in¬ 
efficiency  when  handling  unstructured  problem.  Care  will  be  taken  to  ensure  that  this 
inefficiency  is  not  too  severe. 

Efficiency  and  reliability:  LANCELOT  will  be  efficient  and  reliable.  At  the  beginning  of  a 
software  project,  such  a  statement  is  of  course  a  little  preposterous.  What  is  meant  is 
that  the  algorithm  design  and  implementation  will  systematically  use  efficient  and  reliable 
structures  and  methods. 

’The  *•"  at  the  end  of  ’‘techniques"  is  intentional:  the  present  pilot  version  of  the  software  already  uses  two 
different  exytensions  or  augmentations  of  the  lagrangian. 
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An  important  point  is  that  strong  emphasis  is  put  on  the  theoretical  justification  of  the 
algorithms  and  structures  used  by  the  software.  Suitable  convergence  theory  should  be 
available  (and,  in  part,  already  is:  see  [3],  [5],  [7],  [11]).  Furthermore,  the  final  implemen¬ 
tations  and  the  studied  algorithms  should  differ  as  little  as  possible.  This  requirement  of  a 
well  established  supporting  theory  is  not  considered  as  a  sufficient  condition  guaranteeing 
software  reliability,  but  as  truly  necessary. 

This  theoretical  support  will  be  completed  by  intensive  testing  on  model  problems,  both 
practical  and  more  academic.  The  first  ones  are  crucial  because  they  reflect  best  the 
situation  in  which  the  software  will  be  applied.  The  second  ones  are  important  too  because 
they  introduce  sometimes  extreme  numerical  difficulties:  the  performance  of  the  system 
when  faced  with  these  difficulties  is  easier  to  isolate  and  to  improve  on  such  idealised 
problems  (see  [4]  and  [6]  for  preliminary  tests). 

Of  course,  the  final  efficiency  and  reliability  achieved  is  best  judged  by  the  end-users! 

Scope:  The  domain  of  application  of  the  LANCELOT  system  will  include  a  large  part  of  smooth 
nonlinear  optimization  problems.  Although  primarily  focussed  on  large  scale  problems, 
LANCELOT  will  also  cover  small  and  medium  size  ones.  It  will  exploit  specific  types  of 
constraints,  including 

•  none  (unconstrained  problems), 

•  simple  bounds  on  the  problem’s  variables, 

•  linear  network-type  constraints, 

•  general  linear  constraints, 

•  convex  constraints,  where  the  feasible  domain  is  such  that  a  (possibly  approximate) 
projection  can  be  efficiently  computed, 

•  general  nonlinear  nonconvex  constraints. 

Extension  to  nonsmooth  problems  is  of  interest,  but  is  not  planned  at  this  stage.  The 
emphasis  will  be  on  nonlinear  problems:  LANCELOT  is  not  presently  intended  to  compete 
with  large  linear  programming  packages. 

Ease  of  use:  Inputting  problems  to  LANCELOT  will  be  reasonably  easy.  A  standard  input 
format  for  nonlinear  problems  (SDIF)  has  been  proposed  by  the  authors  to  achieve  this 
objective.  It  features  a  number  of  facilities  to  describe  problem  structure,  along  the  lines 
analysed  in  Section  3  of  this  paper.  For  instancee,  it  automatically  handles  multi-indexed 
variables  as  they  naturally  arise  from  discretizations  of  multi-dimensional  problems  (as  the 
minimum  surface  example  presented  above).  This  format  has  been  formalised  in  [8]  and  has 
already  been  used  for  the  input  of  a  fairly  sizeable  set  of  large  problems.  Although  not  as 
complete  as  a  true  modelling  language,  it  nevertheless  provides  an  important  practical  help 
in  specifying  structured  problems,  as  well  as  invaluable  internal  data  consistency  checks. 


507 


It  is  the  authors’  experience  that  large  structured  nonlinear  problems  arising  from  applica¬ 
tions4  have  been  fully  specified  using  the  SDIF,  validated  and  solved  by  the  pilot  version 
of  LANCELOT  ,  the  whole  process  talcing  less  than  one  hour  (which  we  consider  quite 
reasonable). 

Adaptability;  The  LANCELOT  system  is  also  designed  in  a  very  modular  and  hierarchical  way 
which,  in  turn,  provides  a  good  adaptability  of  the  system  to  extensions,  both  algorithmic 
and  implementation  oriented. 

This  organisation  is  also  made  necessary  by  the  need  to  provide  more  than  one  methodology 
for  some  of  the  algorithmic  parts  of  the  system:  preconditioning  the  large  linear  systems 
arising  from  Newton’s  equation  requires,  for  example,  that  several  strategies  (simple  diag¬ 
onal  scaling,  incomplete  factorisation,  modified  band  techniques,  ...)  be  available  to  the 
user. 

The  adaptability  of  the  LANCELOT  software  is  also  enhanced  by  the  choice  of  a  reverse 
communication  interface  for  the  system. 

Portability:  Because  the  LANCELOT  system  is  designed  to  be  easily  portable,  the  programming 
language  most  commonly  used  for  scientific  applications,  Fortran  77,  has  been  chosen  for 
its  development.  Strict  conformity  with  the  standard  of  the  language  is  enforced  at  all 
levels  of  the  system.  Transfer  between  different  machines  (CRAY,  IBM,  DEC  and  SUN 
mainframes  and  workstations,  ...)  and  operating  systems  (VM/CMS,  VMS,  UNIX,  ...) 
also  takes  place  during  the  development  phases,  in  order  to  ensure  maximal  portability,  not 
only  of  the  end-product,  but  also  of  the  successive  pilot  codes. 

Following  the  arguments  of  Section  2.3,  the  code  is  also  designed  in  a  way  that  has  the 
potential  to  make  it  efficient  on  parallel  and/or  vector  computers. 

4  Conclusions 

We  have  shown  how  the  structure  of  large  complex  nonlinear  problems  can  be  analysed  using  the 
concept  of  (group)  partial  separability.  We  have  also  discussed  some  aspects  of  the  LANCELOT 
project,  whose  purpose  is  the  implementation  of  this  approach  in  a  practical  software  tool. 

Development  of  the  LANCELOT  system  is  ongoing,  both  from  the  theoretical  and  software 
viewpoints.  As  alluded  to  above,  some  layers  and  functionalities  of  the  system  are  already 
operational  and  being  tested.  Detailed  numerical  experiments  with  these  modules  will  shortly  be 
reported  elsewhere. 

The  coherence  of  theoretical  concepts  with  their  applications  to  “read  world"  problems  and 
the  coherence  of  the  theoretical  concepts  between  themselves  are  considered  by  the  authors  to 
be  of  central  importance.  An  approach  to  large  scale  nonlinear  programming  that  has  this  type 
of  coherency  has  been  outlined  in  this  paper,  ranging  from  abstract  convergence  theory  and 

'The  examples  we  have  in  mind  here  were  proposed  by  practitioners  in  the  fields  of  energy  modelling  and  finite 
element  applications.  They  involve  more  that  1000  nonlinear  variables  and  some  of  them  have  nonlinear  constraints. 
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structure  analysis  to  practical  software  implementations.  In  a  domain  just  reaching  maturity, 
this  unified  perspective  is  desirable. 
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Abstract 

This  paper  explores  the  use  of  adaptive  polynomial  preconditioning  for  hermitian  posi¬ 
tive  definite  linear  systems,  Az  =  h.  Such  preconditioners  are  easy  to  employ  and  well-suited 
to  vector  and/or  parallel  machines.  After  examining  the  role  of  polynomial  preconditioning 
in  conjugate  gradient  methods,  we  discuss  the  least  squares  and  Chebyshev  precondition¬ 
ing  polynomials.  We  determine  those  eigenvalue  distributions  for  which  each  is  well-suited. 
We  also  describe  an  adaptive  procedure  for  dynamically  computing  the  optimum  Cheby¬ 
shev  polynomial  preconditioner.  Finally,  in  a  variety  of  numerical  experiments  on  a  Cray 
X-MP/48  and  Alliant  FX/8,  we  demonstrate  the  effectiveness  of  adaptive  polynomial  pre¬ 
conditioning.  Our  results  suggest  that  relatively  low  degree  (2-16)  polynomials  are  usually 
best. 
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1.  Introduction.  This  paper  examines  polynomial  preconditioning  for  hermitian 
positive  definite  (hpd)  linear  systems  of  equations,  Ax  =  b.  Such  systems  arise  in  many  scientific 
applications.  For  example,  the  matrix  resulting  from  the  7-point  finite  difference  approximation 
to  a  three-dimensional  self-adjoint  elliptic  PDE  is  large,  sparse,  and  hpd.  The  conjugate  gradient 
(CG)  method  of  Hestenes  and  Stiefel  [19]  is  a  popular  and  effective  solution  technique  for  these 
linear  systems,  especially  when  combined  with  a  preconditioner.  The  incomplete  Cholesky 
(IC)  factorization  of  Meijerink  and  van  der  Vorst  [23]  is  often  an  effective  preconditioner  for 
CG,  but  other  choices  include  Jacobi  and  SSOR.  In  this  paper,  we  will  consider  polynomial 
preconditioning  for  conjugate  gradient  methods.  That  is,  we  will  solve 

C(A)Ax  =  C(A)b  (1.1) 

where  C(A)  is  a  preconditioning  polynomial  and  C(A)  is  the  associated  polynomial  precondi¬ 
tioner.  We  will  assume  that  C( A)  has  real  coefficients,  in  which  case  both  C(A)  and  C(A)A 
are  hermitian. 

Polynomial  preconditioning  has  several  advantages.  First,  it  is  simple:  there  are  only  two 
intrinsic  operations,  matrix-vector  multiplication  ( matvec )  and  vector  addition  (sazpy).  The 
user  need  only  specify  the  polynomial  degree  and  initialize  a  few  parameters;  the  preconditioning 
may  be  implemented  automatically.  Since  polynomial  preconditioning  requires  only  matrix- 
vector  multiplication,  it  is  ideally  suited  to  “matrix-free”  computations  [6]. 

Polynomial  preconditioning  is  also  versatile.  As  discussed  in  [3],  polynomial  preconditioners 
may  be  used  in  variety  of  CG  methods.  The  best-known  of  these  is  the  PCG  method  of 
Concus,  Golub,  and  O’Leary  [11].  However,  one  may  exploit  the  special  properties  of  polynomial 
preconditioners  to  devise  new  CG  methods  [4].  The  key  to  this  versatility  is  commutativity:  a 
polynomial  in  A  commutes  with  A.  In  other  words,  the  preconditioner  C  commutes  with  the 
matrix  A,  a  property  generally  not  shared  by  other  preconditioners. 

The  main  advantage  of  polynomial  preconditioning  is  its  suitability  for  vector  and/or  par¬ 
allel  architectures.  If  the  matvec  is  vectorizable,  as  when  A  has  a  regular  sparsity  structure, 
polynomial  preconditioning  is  effective  on  vector  machines;  see  [3,  12,  21,  22],  In  contrast, 
incomplete  factorizations  are  difficult  to  vectorize,  especially  for  the  nonexpert.  It  is  also  pos¬ 
sible  to  chain  the  matvecs  (implicit  in  the  preconditioning),  thereby  enhancing  data  locality 
and  reducing  memory  traffic;  see  [9, 10,  27,  28].  Polynomial  preconditioning  is  also  effective  on 
parallel  machines,  especially  those  on  which  inner  products  are  a  bottleneck.  This  is  so  because 
polynomial  preconditioned  CG  methods  converge  in  fewer  steps  than  unpreconditioned  CG,  and 
thus  compute  fewer  inner  products,  albeit  at  the  cost  of  several  matvecs  per  step  instead  of  one. 
However,  in  many  applications  the  matvec  is  parallelizable,  and  so  we  can  expect  an  overall 
reduction  in  CPU  time  on  some  architectures  by  substituting  matvecs  for  inner  products.  The 
effectiveness  of  polynomial  preconditioning  has  been  demonstrated  on  an  Alliant  FX/8  [24]  and 
on  a  Connection  Machine  [7]. 

A  common  complaint  about  polynomial  preconditioning  is  that,  unlike  incomplete  Cholesky, 
it  is  only  marginally  better  than  unpreconditioned  CG,  which  we  will  call  CGHS.  This  criticism 
is  misguided  because  it  is  based  on  the  number  of  iterations  required  for  convergence,  rather 
than  the  CPU  time.  Although  ICCG  may  take  fewer  iterations  than  PPCG,  the  latter  often 
takes  less  time  [12,  22].  Moreover,  even  when  incomplete  Cholesky  is  more  effective,  it  can  be 
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further  accelerated  by  using  a  polynomial  preconditioner.  Specifically,  one  applies  CG  to 

C(M~'A)M~'Ax  =  C(M-lA)M-xb  (1.2) 

where  M  is  the  matrix  representation  of  the  incomplete  factorization.  Notice  that  if  M  and  A 
are  hermitian,  then  so  is  the  preconditioner  C(M-,A)M-1.  Several  CG  methods  are  applicable 
under  these  conditions  [1,  2,  4]. 

We  emphasize  that  polynomial  preconditioning  is  most  effective  on  parallel  machines,  where 
it  typically  outperforms  incomplete  Cholesky.  Moreover,  since  polynomial  preconditioning  can 
be  implemented  automatically,  it  is  as  easy  to  use  as  CGHS.  Thus,  any  improvement  over  CGHS 
is  obtained  essentially  for  free. 

1.1.  Outline  of  Paper.  In  the  next  section  we  review  preconditioned  CG  methods. 
After  presenting  two  implementations  of  a  CG  method,  we  discuss  the  various  ways  in  which 
a  polynomial  preconditioner  can  be  used.  In  §  3  we  examine  polynomial  preconditioning.  In 
particular,  we  discuss  the  least  squares  and  Chebyshev  preconditioning  polynomials,  study  them 
:n  the  context  of  CG  methods,  and  show  that  the  latter  minimizes  a  bound  on  the  condition 
number  of  the  preconditioned  matrix.  We  compare  the  two  polynomials  in  §  4.  In  a  variety 
of  numerical  experiments  we  determine  those  eigenvalue  distributions  for  which  each  is  well- 
suited.  In  §  5  we  describe  an  adaptive  procedure  for  dynamically  computing  \c  and  Arf,  the 
smallest  and  largest  eigenvalues  of  our  hpd  matrix  A.  These  extreme  eigenvalues  are  needed 
to  determine  the  best  Chebyshev  polynomial  preconditioner  for  many  eigenvalue  distributions. 
We  also  present  numerical  results  demonstrating  the  accuracy  and  efficiency  of  the  adaptive 
procedure.  Finally,  in  §  6,  we  summarize  some  numerical  experiments  which  demonstrate  the 
effectiveness  of  polynomial  preconditioning  on  a  variety  of  test  problems.  Our  results  suggest 
that  relatively  low  degree  (2-16)  polynomials  are  usually  best. 

2.  Preconditioned  CG  Methods.  In  this  section  we  examine  the  use  of 
polynomial  preconditioners  in  CG  methods.  To  do  this  it  is  useful  to  first  characterize  CG 
methods.  The  discussion  below  is  culled  from  [3]  and  (4). 

In  [4]  it  is  shown  that  any  CG  method  is  characterized  by  three  matrices:  an  hpd  inner 
product  matrix  B,  a  left  preconditioning  matrix  C,  and  the  original  system  matrix  A.  The 
resulting  CG  method,  CG{B,C,A),  minimizes  ||e,||g  =  (Be,,e;)^  over  V)(CA,Cr o),  where 

V-(CA,Cro)  =  span{Cro,(CA)Cro,(CA)*Cr0,...,(CA)’-lCr0}  (2.1) 

is  a  Krylov  subspace  of  dimension  at  most  «,  e,  is  the  error  in  the  current  iterate,  ro  is  tf  _• 
initial  residual,  and  (•,■)  denotes  the  usual  Euclidean  inner  product.  By  specifying  the  it.ner 
product  matrix  B,  we  obtain  a  particular  CG  method.  For  example,  when  A  is  hpd,  one  may 
take  B  =  A  and  C  =  /,  which  yields  CGHS,  the  original  method  of  Hestenes  and  Stiefel.  If  one 
takes  B  =  A1  and  C  =  /,  the  conjugate  residual  (CR)  method  results. 

The  most  robust  implementation  of  a  CG  method  is  the  so- called  Odir  algorithm  [30]: 

Po  =  Cr0  (2.2) 
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a. 

< Beiypi ) 

( Bpi,pi ) 

(2.3) 

x.+i 

=  +  QtP. 

(2.4) 

r.+i 

=  r,  -  or,  Ap; 

(2.5) 

7i 

(BCApi,Pi) 

(Bp,,  pi) 

(2.6) 

<r. 

(Bpi,pi) 

(Bpi-i,Pi-\) 

(2.7) 

Pi+1 

=  CApi  -  7,p,  -  Oipi-i 

(2.8) 

where  *,-  is  the  current  iterate,  r,-  =  b  —  Ax,  is  its  residual,  and  p,  is  the  current  direction  vector. 
This  algorithm  converges  to  the  solution  of  Ax  =  b  whenever  BCA  is  hermitian.  (For  necessary 
and  sufficient  conditions,  see  [4,  13].)  Since  the  error  e,  is  unknown,  B  must  be  chosen  so 
that  a,-  is  computable.  For  example,  B  =  A  and  B  —  A?  yield  computable  CG  methods.  One 
can  also  express  a,-  in  terms  of  C,  which  allows  greater  flexibility  in  designing  computable  CG 
methods.  Specifically,  a  CG  method  is  computable  whenever  C'Bei  is  computable  [4]. 

When  BCA  is  hpd,  the  cheaper  and  more  familiar  Omin  algorithm  [30]  will  converge: 


Sfl 

= 

Ctq 

(2.9) 

Po 

= 

So 

(2.10) 

a, 

= 

{ Bci , 

(flpi.pi) 

(2-11) 

*•+1 

= 

Xi  +  OiPi 

(2.12) 

ri+l 

= 

Ti  -  a,Ap, 

(2.13) 

A 

= 

(5e,+i,s,+i) 

(2-14) 

St+l 

= 

Cr,+l 

(2.15) 

Pi+J 

= 

*.+«  +  ftp;- 

(2-16) 

Whereas  Odir  uses  a  3-term  recursion  involving  CApi  to  generate  the  new  direction  vector  pI+I, 
Omin  uses  a  2-term  recursion  involving  the  preconditioned  residual,  s,+].  Unfortunately,  Omin 
may  “stall”  when  BCA  is  indefinite,  in  which  case  the  more  expensive  Odir,  or  an  Odir/Omin 
hybrid,  algorithm  should  be  used  [4,  8]. 

When  A  is  hpd  and  C  =  C(  A),  there  are  several  choices  for  the  inner  product  matrix  B. 
A  few  of  the  resulting  CG  methods  are  listed  in  Table  2.1.  Notice  that  BCA  is  hermitian  in 
all  cases,  and  so  Odir  converges.  The  Odir  restrictions  in  Table  2.1  are  sufficient  to  insure  that 
B  is  hpd.  The  Omin  restrictions  are  sufficient  to  guarantee  that  both  B  and  BCA  are  hpd. 
The  first  method  is  PCG.  Like  CGHS,  it  minimizes  the  A- norm  of  the  error,  but  does  so  over 
a  preconditioned  Krylov  subspace.  Although  the  matrix  A  must  be  hpd  to  define  a  norm,  the 
preconditioner  C(A)  only  needs  to  be  hermitian  for  Odir.  If  C(A)  is  hpd,  one  may  use  the 
more  efficient  Omin  algorithm.  The  method  GCGHS,  which  is  CGHS  on  CA ,  minimizes  in  the 
B  =  C(A)A  norm,  and  so  (7(A)  must  be  chosen  so  that  the  preconditioned  matrix  is  hpd.  As  we 
will  see,  this  is  possible.  The  advantage  of  GCGHS  is  this:  if  C(A)  is  a  good  preconditioner,  then 
C(A)A  a  /,  and  so  the  method  more  nearly  minimizes  the  Euclidean  norm  of  the  error.  The 
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Method 

B 

CA 

Odir  Restrictions 

Omin  Restrictions 

PCG 

* 

C(A)A 

A  hpd 

A  hpd,  C(A)  hpd 

GCGHS 

EQQi 

Eon 

C(A)A  hpd 

C(A)A  hpd 

PCR 

EgDZl 

EGO 

C(A)  hpd 

A  hpd,  C(A)  hpd 

PPCR 

K2QZ3 

none 

C(A)A  hpd 

GCR 

KSBZU 

ECU 

none 

C(A)A  hpd 

Table  2.1:  Polynomial  Preconditioned  CG  Methods  for  HPD  A 


next  method,  PCR,  requires  that  C(A)  be  hpd,  in  which  case  C(M)A  is  hpd  because  A  is  hpd. 
The  last  two  methods,  PPCR  and  GCR,  employ  B  =  A2  and  B  =  (C(A)A)2,  respectively.  The 
Odir  algorithm  will  converge  for  either  method;  Omin  is  applicable  if  C(A)A  is  hpd.  Note  that 
PPCR  is  possible  because  CA  =  AC  (which  implies  that  BCA  is  hermitian),  an  advantage  of 
C  being  a  polynomial  in  A.  The  last  method,  GCR,  is  simply  CR  applied  to  the  preconditioned 
matrix,  CA.  We  remark  that  each  method  except  PCG  is  applicable  to  hermitian  indefinite  A; 
see  [5] 

Finally,  we  note  that  the  spectral  and  B  condition  numbers  of  CA  are  identical  for  each  of 
the  methods  in  Table  2.1.  That  is,  k/(CA)  =  kb(CA),  where  kb(CA)  =  ||CA|]b||(CA)-1||b. 
Thus,  estimates  for  the  extreme  eigenvalues  of  CA  yield  a  bound  on  k/(CA),  which  may  be  used 
to  implement  a  stopping  criterion  based  on  the  true  error,  rather  than  the  more  usual  residua] 
error.  Eigenvalue  estimates  for  CA  are  easily  obtained  from  the  CG  iteration  parameters  [4, 11]. 
This  is  also  the  basis  for  the  adaptive  procedure  discussed  in  §  5. 


3.  Polynomial  Preconditioning.  In  this  section  we  examine  several  choices 
for  C(A).  We  wish  to  choose  C  to  accelerate  convergence  of  the  CG  iteration.  One  usually 
chooses  C  to  approximate  A-1  in  some  sense,  for  example,  by  choosing  C(A)  ss  A-1.  Of  course, 
there  are  several  ways  of  doing  this.  As  we  will  see,  there  iB  no  single  “best”  polynomial;  the 
proper  choice  of  C(A)  depends  on  the  eigenvalue  distribution  of  A,  which  is  seldom  known  a 
priori. 

A  simple  choice  for  C(A)  is  based  on  the  Neumann  series.  Let  A  =  M  —  N  and  consider 

A-'  =(M -y)~l  =  (I  +  G  +  G2  +  G3  +  -  -  )M~l  (3.1) 

where  G  =  M“'iV.  If  the  spectral  radius  of  G  is  less  than  one,  the  series  converges.  We  obtain 
our  polynomial  approximation  to  A-1  by  truncating  the  Neumann  series  [2,  7,  12,  22].  The 
advantage  of  this  polynomial  is  its  simplicity:  there  are  no  parameters  to  estimate.  Unfortu¬ 
nately,  it  may  yield  a  poor  preconditioner.  If  one  desires  a  polynomial  preconditioner  of  degree 
m  -  1,  one  can  do  much  better  than  the  Neumann  series  polynomial.  For  example,  Jordan  [22] 
has  shown  that  the  Chebyshev  polynomial  (§  3.2)  is  superior.  Experiments  also  suggest  that 
the  optimum  degree  for  the  Neumann  series  polynomial  is  two  [2, 12,  22],  whereas  the  optimum 
Chebyshev  or  least  squares  polynomial  degree  is  often  higher  [3,  24,  27], 

To  obtain  a  better  preconditioner,  recall  that  C(A)  should  approximate  A~*  in  some  sense. 
That  is,  C(A)  should  be  the  “best”  polynomial  approximation  to  A-1  on  some  set  5  containing 
the  spectrum  of  A,  o(A).  Since  A  is  hpd,  we  will  take  S  =  [e,d],  where  0  <  c  <  d.  Ideally, 
c  =  Ae  and  d  =  Aj,  the  smallest  and  largest  eigenvalues  of  A.  We  next  define  the  “best” 
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1 


Pfn(*) 


Figure  3.1:  Least  squares  preconditioned  polynomial  (m  =  5)  for  5  =  [0,20] 

polynomial  to  be  that  one  which  solves  the  following  approximation  problem: 

min  ||1  -  C(A)A||  (3.2) 

where  rm_i  is  the  set  of  polynomials  of  degree  at  most  m  -  1.  All  that  remains  is  to  specify 
the  norm. 

3.1.  The  Least  Squares  Polynomial  Preconditioner.  Let  us  define  the  inner  prod¬ 
uct 

</,*>  =  l  /(AMTMA)dA  (3.3) 

where  w(A)  is  a  positive  weight  function  on  5  =  [c,d].  It  induces  the  following  norm: 

11/111  =  jP'l.Wo'WdA.  (3.4) 

The  solution  to  (3.2)  in  this  norm  is  called  the  weighted  least  squares  polynomial.  The  associated 
preconditioned  polynomial,  pm(A)  =  C(A)A,  is  illustrated  in  Figure  3.1.  (We  call  pm(A)  the 
preconditioned  polynomial  because  Pm(A)  is  the  preconditioned  matrix.)  Since  the  related 
residual  polynomials,  rm  =  1  —  C(A)A,  are  orthogonal  with  respect  to  the  weight  function 
Aw(A),  the  least  squares  polynomial  may  be  computed  via  a  three-term  recursion,  which  is 
computationally  stable  and  efficient.  See  also  [21,  27]. 

Unlike  the  Chebyshev  polynomial  described  below,  the  least  squares  polynomial  is  biased 
in  its  suppression  of  the  eigenvalues  of  A.  For  example,  when  u>  =  1,  the  eigenvalues  of  larger 
modulus  are  mapped  closer  to  1  than  those  of  smaller  modulus.  If  the  eigenvalue  distribution 
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of  A  were  known,  one  could  choose  u>  to  exploit  this  bias.  In  particular,  one  might  consider  a 
Jacobi  weight  function, 

w(A)  =  (d-A)“(A-c)^,  a,/3  >  — 1.  (3.5) 

By  appropriately  choosing  a  and  /?,  one  portion  of  a(A)  could  be  emphasized  over  another. 

One  should  choose  the  weight  function  ui  so  that  pm(A)  =  C(A)A  is  positive  on  (Ac,  Aj], 
This  guarantees  that  pm(A)  is  hpd,  which  makes  practicable  the  Omin  implementation  of  each 
method  in  Table  2.1.  (Note  that  PCG  and  PCR.  are  applicable  because  C(A)  is  hpd  whenever 
pm(A)  =  C(A)A  is  hpd.)  If  one  employs  a  Jacobi  weight  function  on  [c,d],  one  may  show  [29, 
page  166]  that  pm(A)  >  0  on  [c,d]  if  (a,/3)  €  Wi  =  {(a,/J)  :  a  >  -1/2,  /?  >  -1/2},  which 
includes  the  Legendre  weight  function  u  =  1  (a  =  /)  =  0). 

Saad  [27]  has  noted  that  the  least  squares  polynomial  is  relatively  insensitive  to  c,  and  so 
one  may  take  c  =  0.  Then,  if  a  Jacobi  weight  function  is  used,  the  least  squares  polynomial  is 
given  by  a  scaled  and  translated  Jacobi  polynomial  corresponding  to  a  and  f)  +  1.  Moreover, 
if  (a,/3)  e  Wi  =  {(a,/3)  :  -1  <  a  <  -1/2,  /3  >  -1},  the  relative  extrema  of  the  least  squares 
polynomial  decrease  in  magnitude  on  S  [29].  This  property,  which  is  in  stark  contrast  to  the 
equioscillation  property  of  the  Chebyshev  preconditioned  polynomial  (see  below),  may  be  used 
to  bias  the  preconditioner  toward  the  large  eigenvalues  of  A.  This  property  also  insures  that 
pm(A)  is  positive  on  (0,d],  which  is  important  in  many  of  the  CG  methods  discussed  in  §  2. 

Despite  its  bias,  the  least  squares  polynomial  yields  an  effective  preconditioner  in  many 
cases  [21,  27],  Since  one  may  take  c  =  0,  there  is  no  need  to  estimate  the  smallest  eigenvalue 
of  A.  The  right  endpoint  is  usually  taken  to  be  the  Gershgorin  estimate  for  Ad.  We  discuss  a 
more  sophisticated  adaptive  procedure  for  dynamically  estimating  Ac  and  Xd  in  §  5. 


3.2.  The  Chebyshev  Polynomial  Preconditioner.  Another  interesting  norm  is  the 
uniform  norm: 

ll/lloo  =  max  |/(A)|.  (3.6) 

The  solution  to  (3.2)  in  this  norm  is  obtained  from  a  shifted  and  scaled  Chebyshev  polynomial: 


C(A)A  =  1  - 


rm(**g=tt) 

rm(£§) 


(3.7) 


where  Tm(x)  is  the  m,h  Chebyshev  polynomial  of  the  first  kind  [25].  Notice  that  C( A)  is  indeed 
a  polynomial  in  A.  It  is  attractive  for  several  reasons.  First,  like  the  least  squares  polynomial, 
it  may  be  computed  from  a  three-term  recursion,  which  is  computationally  convenient.  Second, 
since  this  polynomial  is  explicitly  known,  it  is  much  easier  to  devise  an  adaptive  procedure  for 
dynamically  computing  the  optimal  endpoints  c  and  d.  Finally,  this  polynomial  is  unbiased 
in  its  suppression  of  those  eigenvectors  constituting  the  error.  In  other  words,  the  Chebyshev 
preconditioning  polynomial  is  well-suited  to  those  matrices  whose  eigenvalues  are  densely  and 
nearly  uniformly  distributed  throughout  the  interval  S  =  [c,d].  See  §  4. 

This  last  fact  follows  from  the  Chebyshev  minimax  property,  which  states  that  the  precon¬ 
ditioned  polynomial,  Pm(A)  =  C(A)A,  equioscillates  about  1;  see  Figure  3.2.  This  equioscilla¬ 
tion  property  has  several  other  implications.  For  example,  if  o(A)  C  [c,d],  then  <r(pm(A))  C 
[1  -«m.l  +Cm].  where  em  =  Hl-pmllt*  =  |r"'(^)|.  Since  <m  <  1,  the  preconditioned  matrix, 


f 

i 


517 


Figure  3.2:  Chebyshev  preconditioned  polynomial  (m  =  5)  for  S  =  [1,20] 


pm(A),  is  hpd.  One  may  therefore  apply  CGHS  to pm(A),  yielding  the  method  we  call  GCGHS. 
PCG  is  also  applicable  because  C(A)  is  hpd.  Note  that  the  spectral  condition  number  of  pm(A), 
>s(pm(A)),  satisfies 

*(Pm(A))  <  (3.8) 

when  a  (A)  C  [e,rf],  Since  em  is  a  monotonically  decreasing  function  of  m,  this  bound  may  be 
made  as  small  as  desired  by  taking  m  large  enough.  Specifically,  if 


m  > 


cosh->(#£) 


(3.9) 


then  K(pm(/1))  <  6  for  any  S  >  1.  This  follows  from  the  definition  of  Tm(A)  for  A 
The  bound  (3.8)  yields  an  estimate  of  the  number  of  CG  steps  required  for 
One  needs  approximately 

l°(*/2) 

ln(c/) 


>  1. 

convergence. 

(3.10) 


steps  to  reduce  the  error  by  an  amount  i  [18],  where 


cf  =  cf(pm{A))  = 


s/K(Pm(A))  -  1 
\/*(Pm(.A))  +  1 


(3.11) 


is  the  CG  convergence  factor  for  the  hpd  matrix  pm(A).  If  the  eigenvalues  of  pm(A)  are  uni¬ 
formly  distributed  throughout  [1  -  tm ,  1  +  cm],  then  (3.10)  is  fairly  accurate.  Since  *(pm(  A))  < 
k(A)  for  m  >  1,  a  Chebyshev  polynomial  preconditioned  CG  method  will  usually  converge  in 
fewer  iterations  than  the  unpreconditioned  CGHS  method.  Of  course,  each  iteration  is  more 
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expensive,  requiring  m  matvecs  instead  of  one.  We  remark  that 
c  =  Xc  and  d  = 

The  Chebyshev  polynomial  preconditioner  is  also  optimum  in 
k(C(A)A).  This  is  a  consequence  of  the  following 
Theorem  3.1.  A  solution  to 

maxAgs|C(A)A| 

D11R  T  '  1 

C7€»rn-l  minAgs  |C(A)A| 


«(pm(i4))  is  minimized  when 
that  it  minimizes  a  bound  on 


(3.12) 


is  given  by  the  Chebyshev  preconditioning  polynomial. 


Proof:  First  observe  that  (3.12)  does  not  possess  a  unique  solution.  In  particular,  if  Q  solves 
(3.12),  then  so  does  iQ,  where  7  is  any  nonzero  constant.  We  may  assume  C(A)A  >  0  for  A  €  5 
without  loss  of  generality.  (If  C( >'  0  for  some  A  6  S,  then  (3.12)  is  unbounded.)  Thus,  we 

may  restrict  ourselves  to  those  poij  ..jmials  C(A)  for  which 


1  -  nun  (C(A)A)  =  max(C(A)A)  -  1  =  e(C)  =  e.  (3.13) 

The  problem  (3.12)  is  now  equivalent  to  minimizing  This  is,  in  turn,  equivalent  to  solving 
(3.2)  in  the  uniform  norm.  ■ 

Remark:  If  pm  is  the  Chebyshev  preconditioned  polynomial  for  S  and  o(A)  C  S,  the  ratio  in 
(3.12)  gives  a  bound  on  the  condition  number  of  pm(A).  Moreover,  this  bound  is  minimized 
with  respect  to  S  when  S  =  [Ac,  Aj). 

This  theorem  is  similar  to  Theorem  3  in  [21],  but  our  proof  is  different.  It  shows  the  equiv¬ 
alence  of  the  minimax  approximation  problem  (3.2)  and  the  minimization  problem  (3.12).  We 
also  remark  that  Rutishauser  [26]  was  the  first  to  propose  Chebyshev  polynomial  precondi¬ 
tioning  for  CGHS;  his  motive  was  to  mitigate  its  rounding  errors.  We  advocate  polynomial 
preconditioning  because  it  is  well-suited  to  vector  and/or  parallel  architectures. 


3.3.  Implementation.  To  implement  least  squares  or  Chebyshev  polynomial  precondi¬ 
tioning,  one  neither  explicitly  forms  the  powers  of  A  nor  determines  the  coefficients  of  C(A). 
Instead,  one  executes  m  steps  of  a  nonstationary  2-step  iteration.  (In  the  case  of  Chebyshev 
polynomial  preconditioning,  one  uses  the  Chebyshev  iteration  [18].)  Specifically,  one  applies 
the  2-step  iteration  to  the  linear  system  Aw  =  v  with  wo  =  0,  where  v  is  the  vector  to  be 
preconditioned,  usually  the  residual.  One  may  show  that  wm  =  C(A)v.  Note  that  we  need  only 
m  -  I  matrix-vector  multiplications  because  the  final  residual  need  not  be  computed.  We  also 
remark  that  the  three-term  recursion  underlying  the  least  squares  and  Chebyshev  polynomials 
insures  the  stable  evaluation  of  the  preconditioning  polynomial  C(A).  See  also  [21,  27]. 


4.  Chebyshev  versus  Least  Squares.  In  this  section  we  compare  the  least 
squares  and  Chebyshev  preconditioning  polynomials  in  a  variety  of  numerical  experiments.  We 
will  qualitatively  describe  those  matrices  for  which  the  least  squares  polynomial  yields  a  better 
preconditioner  than  the  optimal  Chebyshev  preconditioner.  Moreover,  we  will  explain  why  this 
is  so.  The  importance  of  the  stopping  criterion  will  also  be  discussed. 

Let  us  begin  by  dispelling  a  common  misconception:  the  least  squares  polynomial  is  not 
universally  superior  to  the  optimal  Chebyshev  polynomial.  (The  optimal  Chebyshev  polynomial 
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is  the  one  based  on  [Ac,  Aj],  Recall  that  it  yields  an  optimum  preconditioner  in  the  sense  of 
Theorem  3.1.)  This  follows  from  a  result  of  Greenbaum  [17],  who  established  a  partial  ordering 
on  preconditioners.  In  brief,  her  result  implies  that  the  least  squares  preconditioner  cannot  be 
best  for  every  initial  guess,  zq.  However,  it  is  better  in  certain  cases.  For  although  the  optimal 
Chebyshev  polynomial  minimizes  the  condition  number  of  pm(A),  this  does  not  alone  determine 
the  rate  of  convergence  of  the  preconditioned  CG  method.  The  eigenvalue  distribution  of  pm(  A) 
is  also  important.  Because  of  its  equioscillation  property,  the  Chebyshev  polynomial  tends  to 
map  [c,  d]  uniformly  into  [1  -  cm ,  1  +  cm],  obliterating  any  favorable  clustering  of  the  eigenvalues 
of  A.  The  unweighted  least  squares  polynomial  (u  =  1)  tends  to  map  the  larger  eigenvalues  of  A 
most  closely  about  1,  giving  less  weight  to  the  smaller  eigenvalues.  This  is  due  to  the  tendency 
of  its  relative  extrema  to  decrease  in  magnitude  on  S.  (This  property  can  be  made  strict  by  an 
appropriate  choice  of  weight  function;  see  §  3.1.)  Thus,  if  there  are  relatively  few  eigenvalues  of 
A  near  c,  these  will  become  isolated  eigenvalues  of  the  least  squares  preconditioned  matrix,  the 
majority  of  whose  eigenvalues  will  be  clustered  about  1.  On  the  other  hand,  if  the  eigenvalues 
of  A  are  dense  near  c,  there  will  be  no  such  clustering  of  eigenvalues.  The  proper  choice 
of  polynomial  therefore  depends  on  the  spectrum  of  A.  We  will  now  explore  ihis  question 
numerically. 

In  the  experiments  below,  the  test  matrices  are  diagonal  with  N  =  100,000  eigenvalues 
between  Ac  =  6  and  A j  =  1+6.  The  true  solution  is  the  vector  having  1  in  each  of  its  components 
and  xo  =  0.  Three  eigenvalue  distributions  are  considered.  In  the  first,  A*  =  6  +  1  -  1/fc, 
k  =  2, ...,1V  -  1,  and  so  the  eigenvalues  are  dense  near  the  right  endpoint  d.  In  the  second, 
A*  =  i  +  1/(JV  -  k  +  1),  and  so  the  eigenvalues  are  dense  near  the  left  endpoint  c.  In  the 
third,  the  eigenvalues  are  uniformly  distributed.  In  Figures  4. 1-4.4,  we  plot  the  PCG  relative 
error,  logt0(||etj|]/||e0||]),  against  i  for  three  polynomials  of  degree  m  =  9.  The  first  is  a  least 
squares  polynomial  (with  Legendre  weight  w  s  1)  based  on  [0, 1  +  <];  the  second  is  the  optimal 
Chebyshev  polynomial  based  on  [6, 1  +  <];  and  the  third  is  a  Chebyshev  polynomial  based  on 
[C,  1  +6],  where  ( is  chosen  so  the  related  least  squares  and  Chebyshev  residual  polynomials  have 
the  same  first  root.  By  choosing  (  in  such  a  manner,  we  force  this  LS-Chebyshev  polynomial 
to  mimic  the  behavior  of  the  least  squares  polynomial  in  (0,C),  and  so  the  two  polynomials 
behave  alike. 

In  Figure  4.1,  we  have  the  dense-right  eigenvalue  distribution  with  S  =  10~3.  Since  the 
least  squares  polynomial  is  small  on  the  large  eigenvalues  of  A,  the  least  squares  PCG  method 
converges  much  more  rapidly  than  the  optimal  Chebyshev  PCG  method.  In  Figure  4.2,  the 
eigenvalues  are  dense  near  the  left  endpoint,  and  the  optimal  Chebyshev  PCG  method  converges 
faster.  Similar  results  were  observed  for  other  values  of  l.  In  Figures  4.3-4.4,  we  have  the 
uniform  eigenvalue  distribution.  When  i  =  10"3,  the  gap  between  successive  eigenvalues  is  1/JV, 
which  is  smaller  than  Ac,  and  the  optimal  Chebyshev  PCG  method  converges  fastest.  However, 
if  S  =  10-8,  the  gap  between  successive  eigenvalues  is  larger  than  Ac,  and  the  least  squares 
PCG  method  converges  faster.  We  have  seen  similar  behavior  in  several  other  experiments.  In 
short,  the  optimal  Chebyshev  polynomial  appears  to  be  superior  to  the  least  squares  polynomial 
when  the  gap  between  successive  eigenvalues  is  small  relative  to  the  size  of  Ac.  The  optimal 
Chebyshev  polynomial  is  also  superior  to  the  least  squares  polynomial  when  the  eigenvalues  of 
A  are  dense  near  both  endpoints  of  S  or  throughout  S.  In  the  latter  case,  for  instance,  if  we 
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Number  of  Iterations 


Figure  4.1:  Dense- Right  Eigenvalue  Distribution;  S  =  10*~3 

Sx<=  10-3  and  increase  N,  the  optimal  Chebyshev  polynomial  performs  best  for  large  N. 

As  claimed,  the  LS-Chebyshev  polynomial  behaves  like  the  least  squares  polynomial  and, 
depending  on  the  eigenvalue  distribution  of  A,  may  be  superior  to  the  optimal  Chebyshev 
polynomial.  For  example,  if  the  eigenvalues  of  A  are  sparse  near  c  and  dense  near  d  (recall 
Figure  4.1),  the  LS-Chebyshev  PCG  method  will  usually  converge  in  fewer  iterations  than  the 
optimal  Chebyshev  PCG  method.  The  explanation  is  similar  to  that  for  the  superiority  of  the 
least  squares  polynomial:  The  LS-Chebyshev  polynomial  maps  those  eigenvalues  in  [C,  d]  more 
tightly  about  1  than  does  the  optimal  Chebyshev  polynomial.  Of  course,  those  eigenvalues 
in  [c,0  are  mapped  further  away  from  one.  However,  there  are  relatively  few  eigenvalues  in 
(e,();  moreover,  they  become  isolated  eigenvalues  of  C(A)A.  It  is  well-known  that  CG  rapidly 
damps  the  error  in  the  direction  of  the  corresponding  eigenvectors.  After  doing  this,  it  is  able 
to  focus  its  effort  on  the  dense  part  of  the  spectrum  where  the  LS-Chebyshev  polynomial  does 
a  better  job  of  clustering  the  eigenvalues  about  1.  Since  one  seldom  knows  how  the  eigenvalues 
of  A  are  distributed,  one  must  rely  on  an  adaptive  procedure  to  find  the  optimum  S.  If  one 
knew  the  eigenvalue  distribution  of  A,  an  appropriately  weighted  Chebyshev  or  least  squares 
polynomial  could  be  used  to  achieve  faster  convergence.  FYeund  [14]  has  recently  proposed 
using  the  Lanczos  eigenvalue  estimates  to  obtain  such  a  weight  function,  and  his  results  are 
promising.  In  particular,  he  has  shown  that  the  resulting  preconditioned  CG  method  often 
converges  faster  than  the  method  based  on  the  optimal  Chebyshev  polynomial.  Unfortunately, 
there  is  no  guarantee  that  the  preconditioned  matrix  will  be  hpd  for  all  5,  and  this  can  make 
difficult  or  impossible  the  robust  implementation  of  some  adaptive  CG  algorithms. 

Finally,  we  note  that  the  choice  of  stopping  criterion  can  also  affect  the  choice  of  polynomial. 
Since  the  least  squares  polynomial  is  small  on  the  large  eigenvalues  of  A ,  it  is  biased  toward 
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Figure  4.2:  Dense- Left  Eigenvalue.  Distribution;  S  =  10  3 


Figure  4.3:  Uniform  Eigenvalue  Distribution;  6  =  10' 


Number  of  Iterotiona 

Figure  4.4:  Uniform  Eigenvalue  Distribution;  S  =  10-5 

this  part  of  the  spectrum.  Those  eigenvectors  associated  with  the  luge  eigenvalues  of  A  are 
consequently  damped  the  most.  If  one  bases  the  stopping  criterion  on  the  relative  residual, 
the  eigenvectors  corresponding  to  these  large  eigenvalues  are  given  greater  weight.  Thus,  this 
stopping  criterion  is  ideally  suited  to  the  least  squares  polynomial.  The  optimal  Chebyshev 
polynomial,  on  the  other  hand,  is  well-suited  for  use  in  stopping  criteria  based  on  the  true 
error.  (Although  the  true  error  is  unknown,  it  can  be  bounded  [4].)  The  difference  between 
these  stopping  criteria  can  be  as  large  as  ks(A). 

4.1.  The  Need  for  an  Adaptive  Procedure.  Recall  that  the  weighted  least  squares 
and  uniform  norms  are  defined  with  respect  to  the  positive  interval  S,  which  we  have  assumed 
contains  the  spectrum  of  A.  That  is,  we  have  assumed  that  the  smallest  and  largest  eigenvalues 
of  A,  Ac  and  Xj,  are  given.  Unfortunately,  this  is  seldom  true.  In  the  case  of  the  least  squares 
polynomial,  one  may  avoid  this  difficulty  by  choosing  c  and  d  to  be  the  Gershgorin  estimates 
for  Ac  and  Aj.  In  particular,  one  may  take  c  =  0.  The  resulting  preconditioner  is  often  effective, 
but  there  are  eigenvalue  distributions  for  which  the  optimal  Chebyshev  polynomial  is  better. 
Here  one  needs  accurate  estimates  for  the  extreme  eigenvalues  of  A.  Although  this  might 
be  viewed  as  a  reason  for  using  the  Neumann  series  or  least  squares  polynomial,  it  is  not. 
As  we  will  see,  one  may  dynamically  estimate  Ae  and  A4  from  the  CG  iteration  parameters. 
This  is  equivalent  to  dynamically  determining  the  optimum  polynomial  preconditioner  (in  the 
sense  of  Theorem  3.1).  The  resulting  adaptive  polynomial  preconditioned  CG  algorithm  works 
remarkably  well  in  practice:  it  quickly  and  accurately  determines  Ae  and  Aj.  We  describe  this 
idea  in  the  next  section. 
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5.  Adaptive  CG  Algorithms.  In  this  section  we  discuss  adaptive  CG  algo¬ 
rithms.  In  such  an  algorithm  we  apply  a  given  CG  method  to  the  preconditioned  linear  system 
C(A)Ax  =  C(A)6,  where  C(A)  is  the  current  preconditioning  polynomial.  Information  about 
the  spectrum  of  A  is  extracted  from  the  CG  iteration  parameters  and  used  to  obtain  a  bet¬ 
ter  preconditioner,  (7(A).  If  this  new  preconditioner  is  sufficiently  better  than  C(A),  the  CG 
method  is  restarted  with  <7;  otherwise,  the  current  iteration  resumes  with  C.  In  this  way  the 
adaptive  CG  algorithm  dynamically  determines  the  optimum  polynomial  preconditioner  for  A. 

Determining  this  optimum  preconditioner  is  equivalent  to  determining  the  smallest  set  S  = 
[c,d]  that  contains  <r(A),  the  spectrum  of  A.  Ideally,  S  =  2(A)  =  [Ac,Aj],  the  convex  hull  of 
<t(A).  This  yields  the  optimum  Chebyshev  preconditioned  polynomial,  pm,  which  minimizes  the 
bound  on  n(pm(A))  obtained  from  (3.12).  However,  the  extreme  eigenvalues  of  A  are  seldom 
known  a  priori,  and  so  5  is  only  an  approximation  to  2(A).  The  development  of  an  adaptive 
procedure  for  dynamically  improving  this  approximation  is  the  subject  of  this  section,  which  is 
taken  from  [3]. 

Although  we  will  consider  only  the  Chebyshev  polynomial,  a  similar  procedure  could  be  used 
with  the  least  squares  polynomial  to  extract  eigenvalue  estimates  from  the  CG  iteration.  How¬ 
ever,  this  is  unnecessary  since  the  least  squares  polynomial  is  insensitive  to  its  inner  endpoint. 
One  may  take  c  =  0  and  choose  d  to  be  the  Gershgorin  estimate  for  Aj.  See  [27]. 


5.1.  Description  of  the  Adaptive  Procedure.  Given  an  interval  S  that  approximates 
2(A),  and  a  Chebyshev  preconditioning  polynomial  C(A)  based  on  S,  a  CG  method  is  applied 
to  C(A)Ax  =  C(A)b.  After  a  prescribed  number  of  steps,  say  t,  the  adaptive  procedure  is 
called: 

(1)  Compute  eigenvalue  estimates  for  pm(A)  =  C(A)A. 

(2)  Extract  eigenvalue  estimates  for  A  and  update  5. 

(3)  Determine  the  new  preconditioning  polynomial,  (7(A). 

(4)  Resume  or  restart  the  CG  iteration,  whichever  is  appropriate. 


After  another  l  CG  steps,  the  adaptive  procedure  is  called  again,  and  so  on  until  convergence. 

Eigenvalue  estimates  for  Pm(A)  are  easily  obtained  from  the  CG  iteration  parameters  by 
exploiting  the  equivalence  of  the  CG  and  Lanczos  algorithms  [4,  11,  16].  (See  [15]  for  an 
alternative.)  As  we  will  see,  it  is  then  easy  to  recover  eigenvalue  estimates  for  A  when  the 
degree  m  of  the  polynomial  is  odd.  Once  we  have  these  estimates,  we  can  expand  S  and 
determine  the  new  Chebyshev  preconditioning  polynomial.  If  this  polynomial  is  “much”  better 
than  the  current  polynomial  (§  5.2),  the  CG  iteration  is  restarted.  By  this  we  mean  that  the 
current  iteration  is  abandoned  and  the  CG  method  is  applied  to  C(A)Ax  =  C(A)b.  The  new 
initial  guess  is  the  last  iterate  of  the  previous  iteration  or  some  Unear  combination  of  past 
iterates. 

To  elucidate,  suppose  we  are  executing  a  CG  iteration.  Let  S  be  the  current  approximation 
to  2(A)  and  let 


pm(A)  =  C(A)A  =  1  - 


Tm(fe) 


(5.1) 


be  the  current  Chebyshev  preconditioned  polynomial.  Note  that  the  image  of  5  under  pm  is 
J,  =  [1  -  (,  1  +  c],  where  (  =  T^'(^) I  recall  §  3.2.  Next  assume  the  adaptive  procedure  has 
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Figure  5.1:  Chebyshev  preconditioned  polynomial  (m  =  5)  for  S  =  [1, 10] 

been  called,  and  let  p  be  an  eigenvalue  estimate  for  pm(A)  such  that  p  e  E(pm(A)),  the  convex 
hull  of  <7(pm(A)).  (This  is  true  of  the  estimates  we  will  obtain.)  The  desired  eigenvalue  estimate 
for  A  is  one  of  the  inverse  images  of  p;  the  task  is  to  determine  which  one.  It  is  important  that 
the  inverse  image  chosen  lie  in  £(A).  Otherwise  S  is  improperly  and  irrevocably  expanded, 
which  slows  the  convergence  of  subsequent  CG  iterations. 

Suppose  first  that  p  €  Jt.  Then  there  exists  an  inverse  image  of  p  inside  5.  Since  there  is 
no  justification  for  expanding  5,  this  estimate  may  be  discarded.  If  every  eigenvalue  estimate 
for  Pm(A)  is  in  Jt,  there  is  no  need  to  update  5,  and  the  CG  iteration  resumes.  The  adaptive 
procedure  has  yielded  no  new  information. 

Now  suppose  p  $  Jt.  Since  p  €  E(pm(A)),  p  must  have  an  inverse  image  in  E(A)\£, 
which  means  there  is  an  eigenvalue  of  A  outside  S.  If  an  estimate,  A,  of  this  eigenvalue  can  be 
recovered,  S  can  be  expanded,  and  a  new,  better  preconditioner  computed.  When  m  is  odd, 
it  easy  to  extract  A  from  p,  as  may  be  seen  in  Figure  5.1.  For  example,  let  p  =  pi  <  1  -  t. 
Since  Pm(A)  is  monotonically  increasing  for  A  e  (0,c),  there  is  a  unique  A!  €  (0,c)  such  that 
p i  =  Pm(Ai).  Moreover,  since  pi  £  H(pm{A)),  Ai  must  lie  in  E (A).  Therefore,  c  should  be 
decreased  to  Aj.  Similarly,  if  p  =  pj  >  1  +  e,  d  should  be  increased  to  Aj,  the  unique  inverse 
image  of  pj.  Note  that  estimates  for  only  the  smallest  and  largest  eigenvalues  of  Pm(A)  are 
needed,  for  these  yield  estimates  for  the  extreme  eigenvalues  of  A. 

To  compute  the  inverse  images  of  pi  and  pa,  a  rootfinder  could  be  used,  but  this  is  unnec¬ 
essary  because  Pm(A)  is  known  explicitly.  For  m  odd  and  d  /  c,  one  may  show 

Ai  =  i  +  c)  -  (d-  c)cosh^cosh_1  c'*1)) 
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Figure  5.2:  Chebyshev  preconditioned  polynomial  (m  =  4)  for  5  =  [1,10] 

and 

A2  =  |  ((d  +  e)  +  (d  -  c) cosh  ^  cosh"1  (^7^-))  ■  (5-3) 

If  d  ss  e  (a  common  choice  for  the  initial  S),  then 

A!  *  d  (l  —  (1  —  and  A,  «  d  (l  +  (W  -  l)1/m)  -  (5.4) 

So  far  we  have  assumed  that  m  is  odd,  which  is  important  for  two  reasons.  To  see  why, 
consider  Figure  5.2,  in  which  m  is  even.  As  before,  any  eigenvalue  estimates  for  pm(A)  in  7, 
are  discarded.  Since  both  tails  of  pm(A)  are  negative,  there  can  be  no  estimate  p  >  1  +  e,  so 
suppose  n  <  1  —  t.  There  are  now  two  inverse  images,  Ai  and  A2,  at  least  one  of  which  lies 
in  E(A).  The  question  is  this:  which  one  is  it?  If  the  wrong  one  is  chosen,  the  set  5  may  be 
incorrectly  enlarged,  and  the  CG  method  will  converge  more  slowly  than  necessary.  To  avoid 
this  ambiguity,  we  shall  always  choose  m  odd. 

Another  advantage  of  choosing  m  odd  is  that  it  yields  robust  CG  methods.  By  this  we  mean 
that  pm(A)  is  hpd  for  any  hpd  A  and  for  any  set  S.  If  this  were  not  true,  the  CG  method  might 
not  be  defined  in  the  early  iterations.  For  example,  if  m  were  even  and  there  were  an  eigenvalue 
of  A  greater  than  the  largest  root  of  C(A)A,  pm(A)  would  be  indefinite,  in  which  case  GCGHS 
and  PCR  are  inappropriate,  as  are  the  Omin  implementations  of  PCG,  PPCR  and  GCR.  One 
would  have  to  use  the  Odir  or  Odir/Omin  implementation  of  PCG,  PPCR,  or  GCR.  When  m 
is  odd,  on  the  other  hand,  Pm(A)  is  hpd  for  any  set  5  and  the  Omin  implementation  of  each 
method  in  Table  2.1  is  applicable. 

5.2.  Resume  versus  Restart.  After  eigenvalue  estimates  for  A  are  computed,  the  set 
S  is  expanded,  and  a  new  preconditioning  polynomial,  C(A),  is  determined.  The  adaptive 
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procedure  must  now  decide  whether  the  CG  iteration  should  be  resumed  using  the  current 
preconditioner  or  restarted  using  the  new  preconditioner.  This  choice  is  discussed  next. 

Let  Pm(X)  be  the  preconditioned  polynomial  for  S  =  [c,d],  the  current  approximation  to 
£(A).  Also  let  hi  <  1  -  c  and  Hi  >  1  +  e  be  the  new  eigenvalue  estimates  for  pm(A),  and  let  Ai 
and  Aj  be  their  inverse  images.  If  the  iteration  is  resumed  using  the  the  current  preconditioner, 
the  revised  convergence  factor  is 

<*—$71  (5S) 

where  k  =  Hi  I  Hi  is  the  revised  estimate  for  K(pm(A)). 

Next  let  pm(A)  be  the  preconditioned  polynomial  for  §  =  [A),  A]],  the  new  approximation 
to  £(A).  If  the  iteration  is  restarted  using  the  new  preconditioner,  the  convergence  factor  is 

c/«w  =  c/(p-m(A))  =  (5.6) 

VK+  1 

where  _ 

<*•« 

is  the  condition  number  estimate  for  pm(A). 

Although  cfncw  <  c /reti,  the  difference  may  be  too  small  to  warrant  restarting  the  CG 
iteration.  Using  equation  (3.10),  it  is  possible  to  predict  the  number  of  CG  steps  the  current 
iteration  will  need  to  converge  to  within  some  tolerance.  It  is  also  possible  to  predict  how  many 
6teps  the  restarted  iteration  will  need.  If  these  two  numbers  differ  by  one,  for  example,  the 
iteration  should  be  resumed.  Otherwise,  the  CG  iteration  should  be  restarted. 

Since  S  is  either  expanded  or  unchanged  with  each  call  to  the  adaptive  procedure,  it  is 
important  that  the  initial  set,  So,  be  such  that  So  C  £(A).  If  the  matrix  A  is  scaled  to  have  unit 
diagonal,  one  may  take  So  =  [1, 1].  A  more  general  choice  is  So  =  [r,  r],  where  r  =  trace(A)/N 
and  N  is  the  order  of  A.  One  might  also  consider  initially  using  the  LS-Chebyshev  polynomial. 

It  is  important  to  note  that  S  =  E(A)  need  not  give  the  optimum  rate  of  convergence.  In 
practice,  convergence  may  be  more  rapid  when  5  is  a  proper  subset  of  £(A).  To  understand 
why,  recall  that  the  true  rate  of  convergence  of  a  CG  method  depends  on  the  distribution  of  the 
eigenvalues  of  A  within  £(A).  For  example,  suppose  Aj  <  A2.  Then  5  =  [Aj.Ajv]  is  a  better 
choice  than  [A|,Ajv].  The  reason  is  that  CG  methods  pick  out  isolated  eigenvalues  and  rapidly 
suppress  the  error  in  the  direction  of  corresponding  eigenvectors.  Unfortunately,  one  seldom 
knows  such  detailed  information  about  <r(A).  Although  the  recent  idea  of  Freund  [14]  to  use 
the  Lanczos  eigenvalue  estimates  to  approximate  the  eigenvalue  distribution  of  A  is  appealing, 
it  is  unclear  whether  this  can  be  done  dynamically. 

5.3.  Performance.  Having  introduced  the  theory  behind  the  adaptive  procedure,  we  now 
consider  its  performance  in  practice.  As  we  will  see,  the  adaptive  procedure  works  remarkably 
well  in  that  it  quickly  and  accurately  determines  Ac  and  Aj.  Since  this  section  is  concerned  only 
with  the  performance  of  the  adaptive  procedure,  the  test  problems  are  small.  Results  for  much 
larger  problems  are  given  in  $  6,  where  we  examine  the  performance  of  adaptive  polynomial 
preconditioned  CG  algorithms. 
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N  =  2500  m  =  7 


* 

A,* 

action 

111 

0.10000e+01 

0.10000e+01 

initial 

5 

0.24762e-01 

0.19870e+01 

restart 

10 

0.27832e-02 

0.19962e+01 

restart 

15 

0.27577e-02 

0.19972e+01 

resume 

20 

0.19262e-02 

0.19981e+01 

resume 

25 

0.18981e-02 

0.19981e+01 

resume 

0.18968e-02 

0.19981e+01 

resume 

irm 

0.18967e-02 

0.19981e+01 

Table  5.1:  PCG  Adaptive  Procedure  for  5-Point  Laplacian 


The  tables  below  summarize  the  behavior  of  the  adaptive  procedure  for  two  simple  test 
problems.  The  matrices  have  order  2500  and  result  from  a  5-  point  and  9- point  finite  difference 
approximation  to  the  2-dimensional  Laplacian.  Although  PCG  results  are  given  only  for  a 
polynomial  of  degree  7,  these  results  are  typical.  In  each  table  we  list  the  estimates  for  Ac 
and  computed  by  the  adaptive  procedure,  which  is  called  every  five  steps.  The  adaptive 
algorithm  is  initially  given  e  =  d  —  1.  In  the  last  column  we  report  the  action  taken  by  the 
adaptive  procedure. 

Consider  Table  5.1.  After  five  steps,  the  adaptive  procedure  found  new  estimates  for  Ac  and 
\d  and  decided  to  restart  the  iteration  using  a  new  preconditioning  polynomial  based  on  these 
estimates.  After  another  five  steps,  the  adaptive  procedure  refined  its  estimates  for  Xc  and  Xj 
and  again  restarted.  From  here  on  it  continues  to  improve  its  estimates  for  Ac  and  Aj,  but  opts 
to  resume  the  iteration  using  the  polynomial  determined  at  step  10.  Similar  behavior  is  seen  in 
Table  5.2.  We  remark  that  Ae  and  are  found  more  quickly  with  higher  degree  polynomials. 

The  performance  described  here  is  typical.  Although  the  estimates  for  Xc  and  Xj  eventually 
converge  to  their  true  values,  the  adaptive  procedure  often  finds  satisfactory  estimates  early 
on  in  the  iteration.  In  other  words,  the  adaptive  procedure  is  able  to  find  a  nearly  optimum 
polynomial  preconditioner  within  a  few  calls.  This  means  that  there  is  little  overhead  associated 
with  the  adaptive  procedure.  Finally,  we  remark  that  the  resume  versus  restart  decision  is  an 
important  one:  it  can  make  a  dramatic  difference  in  the  number  of  steps  required  for  convergence 
to  the  solution  of  the  linear  system. 


N  =  2500  m  =  7 


□= 

action 

0.10000e+01 

initial 

5 

0.14347e+01 

restart 

10 

0.15959e+01 

restart 

15 

0.22899e-02 

0.15M8e+01 

restart 

El 

0.15990e+01 

resume 

0.15992e+01 

resume 

ica 

|  0.22753e-02 

0.15992e+01 

Table  5.2:  PCG  Adaptive  Procedure  for  9- Point  Laplacian 
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6.  Numerical  Experiments.  In  this  section  we  demonstrate  the  effectiveness 
of  adaptive  polynomial  preconditioning  on  a  Cray  X-MP/48  and  Alliant  FX/8  for  some  large 
matrices  arising  in  hydrology.  In  particular,  we  show  that  polynomial  preconditioned  PCG 
(PPCG)  can  converge  in  less  CPU  time  than  the  unpreconditioned  CGHS  method.  Although 
we  do  not  compare  it  with  other  preconditionings,  we  emphasize  that  polynomial  precondition¬ 
ing  can  be  used  to  further  accelerate  any  other  preconditioning,  for  example,  sin  incomplete 
factorization. 

6.1.  Description  of  Experiments.  Our  test  matrices,  which  arise  in  the  modeling  of 
groundwater  flow  in  a  heterogeneous  aquifer,  result  from  the  7-point  finite  difference  approxima¬ 
tion  to  a  three-dimensional  elliptic  PDE  with  variable  coefficients.  Although  several  parameters 
determine  the  difficulty  of  the  problem,  we  isolate  just  two.  In  the  first  set  of  experiments,  run 
on  a  Cray  X-MP/48,  the  hydraulic  conductivity  field  K  is  uncorrelated,  which  makes  the  prob¬ 
lem  difficult.  In  the  second  set  of  experiments,  run  on  am  Alliant  FX/8,  the  field  is  correlated. 
For  each  machine  we  vary  7,  the  standard  deviation  of  the  In  K  field.  As  7  increases,  so  does 
k(A),  the  condition  number  of  A.  See  [24]  for  details. 

In  the  tables  below,  m  is  the  degree  of  the  preconditioned  polynomial,  pm( A).  The  first 
row  of  each  table,  m  =  1,  corresponds  to  the  unpreconditioned  CGHS  method.  We  next  give 
the  number  of  CPU  seconds  required  for  convergence  of  the  CG  iteration,  which  includes  the 
adaptive  procedure.  The  iteration  was  halted  once  the  relative  error  was  brought  below  10~8 
on  the  Cray  and  10~8  on  the  Alliamt.  In  the  last  column  of  each  table  we  list  the  ratio  of  CGHS 
time  to  PPCG  time.  If  this  ratio  is  greater  than  1,  we  say  that  polynomial  preconditioning 
is  effective.  In  all  the  experiments,  the  right-hand-ride  vector  6  was  chosen  so  that  the  true 
solution  vector  has  1  in  each  component,  and  the  initial  guess  was  the  zero  vector.  Since  the 
matrix  was  symmetrically  scaled  to  have  unit  diagonal,  we  set  eo  =  do  =  1  in  the  adaptive 
procedure.  New  eigenvalue  estimates  were  computed  every  ten  steps  with  a  maximum  of  ten 
calls  to  the  adaptive  procedure.  Finally,  we  note  that  the  results  below  were  taken  from  [3]  and 
[24]. 


6.2.  Discussion  of  Results.  In  Tables  6. 1-6. 3  we  report  results  for  a  single  vector 
processor  of  a  Cray  X-MP/48.  In  the  first  table,  the  condition  number  of  A  is  about  60,000, 
as  estimated  by  the  adaptive  procedure.  Here  we  obtain  a  15%  improvement  over  CGHS  with 
a  polynomial  of  degree  5.  In  the  next  two  tables,  k(A)  is  160,000  and  360,000,  respectively, 
corresponding  to  7  =  1.5  and  7  =  1.75.  Notice  that  polynomial  preconditioning  is  more  effective 
here:  it  reduces  the  CPU  time  required  to  solve  the  problem  by  about  25%.  Also  observe  that 
the  optimum  m  is  increasing  with  k(A). 

In  Tables  6.4-6.6  we  see  similar  qualitative  results  for  the  Alliant  FX/8,  which  is  an  8- 
vector- processor  machine.  Although  the  problems  are  much  larger,  they  are  not  nearly  as 
ill-conditioned.  (We  estimate  the  condition  numbers  to  be  8,400,  14,000,  and  26,000.)  We 
once  again  see  the  best  performance  on  the  hardest  problem.  Moreover,  notice  the  much  larger 
CGHS/PPCG  ratios:  the  time  required  to  solve  the  problem  has  been  nearly  cut  in  half.  The 
computer  architecture  does  indeed  make  a  difference. 
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N  =  103,823 _ 7  =  1.0 


o 

Iterations 

Seconds 

1 

1112 

3 

386 

15.77 

1.14 

5 

15.65 

1.15 

H 

20.59 

0.87 

n 

215 

24.82 

0.73 

n 

152 

21.35 

0.84 

TVible  6.1:  Cray  X-MP/48  CPU  Times 


N  =  103,823 _ 7  =  1.5 


m 

Iterations 

Seconds 

CGHS/PPCG 

i 

2315 

37.04 

1.00 

3 

780 

31.81 

1.16 

5 

473 

30.56 

1.21 

7 

341 

30.22 

1.23 

9 

268 

30.16 

1.23 

11 

227 

31.28 

1.18 

13 

213 

34.84 

1.06 

Table  6.2:  Cray  X-MP/48  CPU  Times 


6.3.  Conclusions.  In  this  section  we  have  the  demonstrated  the  effectiveness  of  polyno¬ 
mial  preconditioning  on  a  Cray  X-MP/48  and  an  Alliant  FX/8.  We  have  seen  that  polynomial 
preconditioning  is  most  effective  when  the  matrix  A  is  ill-conditioned.  Moreover,  as  k(A)  in¬ 
creases,  so  does  the  optimum  degree  m.  In  general,  however,  low  degree  (2-16)  preconditioning 
polynomials  are  usually  best.  In  contrast,  high  degree  (20-50)  polynomial  are  usually  best  for 
hermitian  indefinite  matrices  [3,  5].  Although  we  have  presented  results  for  only  the  hydrology 
problem,  our  conclusions  are  supported  by  a  variety  of  other  numerical  experiments,  including 
those  in  [3,  7,  12,  22,  24]. 

We  emphasize  that  our  adaptive  CG  algorithms  are  as  easy  to  use  as  CGHS,  yet  can  reduce 


N  =  103,823 _ 7  =  1.75 


Q 

Iterations 

Seconds 

1 

mssM 

66.76 

1.00 

3 

US 

57.30 

1.17 

5 

833 

54.67 

1.22 

aij/rnMi 

53.72 

1.24 

469 

53.82 

1.24 

11 

386 

53.60 

1.25 

13 

328 

53.55 

1.25 

15 

287 

53.55 

1.25 

17 

255 

53.36 

1.25 

19 

235 

55.10 

1.21 

Table  6.3:  Cray  X-MP/48  CPU  Times 
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Table  6.5:  Alliant  FX/8  CPU  Times 


the  CPU  time  required  to  solve  the  linear  system.  The  amount  of  reduction  depends  on  the 
computer  architecture.  We  note  that  Holst  [20]  has  obtained  results  similar  to  those  for  the 
Alliant  on  a  Cray  2.  In  particular,  he  has  reported  CGHS/PPCG  ratios  of  nearly  2  to  1,  which 
is  far  better  than  those  achieved  on  the  X-MP.  Chan  et  al.  [7]  have  shown  that  polynomial 
preconditioning  is  competitive  with  other  preconditioners  on  the  massively  parallel  CM-2. 


7.  Summary.  In  this  paper  we  have  explored  the  use  of  adaptive  polynomial  pre¬ 
conditioning  for  hermitian  positive  definite  linear  systems.  Such  preconditioners  are  easy  to 
employ  and  well-suited  to  vector  and/or  parallel  computer  architectures.  After  reviewing  pre¬ 
conditioned  CG  methods,  we  showed  how  one  could  use  a  polynomial  preconditioner  in  a  variety 
of  different  ways.  We  then  discussed  the  least  squares  and  Chebyshev  preconditioning  poly¬ 
nomials,  studied  them  in  the  context  of  CG  methods,  and  showed  that  the  latter  minimizes 
a  bound  on  the  condition  number  of  the  preconditioned  matrix.  We  next  compared  the  two 


N  =  410,625 _ 7  =  2.3 


ia 

Iterations 

Seconds 

i 

839 

1584.50 

1.00 

3 

308 

1108.60 

1.43 

5 

205 

1057.20 

1.50 

H 

134 

913.34 

1.73 

H 

103 

869.94 

1.82 

Efl 

88 

889.15 

1.78 

Table  6.6:  Alliant  FX/8  CPU  Times 


polynomials  in  a  variety  of  numerical  experiments.  In  particular,  we  sought  to  determine  those 
eigenvalue  distributions  for  which  each  is  well-suited.  The  least  squares  polynomial  is  superior 
for  those  matrices  whose  eigenvalues  are  dense  near  the  largest  eigenvalue,  Xj.  In  contrast,  the 
Chebyshev  preconditioner  is  superior  when  the  eigenvalues  are  dense  throughout  the  interval 
or  when  the  gap  between  successive  eigenvalues  is  smaller  than  the  smallest  eigenvalue,  Xc.  We 
next  described  an  adaptive  procedure  for  dynamically  computing  Xc  and  Xj,  which  are  needed 
to  determine  the  optimal  Chebyshev  polynomial  preconditioner.  The  accuracy  and  efficiency  of 
this  adaptive  procedure  was  also  demonstrated.  Finally,  in  the  previous  section,  we  presented 
some  numerical  results  that  demonstrate  the  effectiveness  of  adaptive  polynomial  precondi¬ 
tioning  for  some  large  matrices  arising  in  hydrology.  Our  results  suggest  that  relatively  low 
degree  (2-16)  polynomials  are  usually  best.  Moreover,  the  optimum  degree  m  of  the  polyno¬ 
mial  tends  to  increase  with  the  condition  number  of  A ,  as  does  the  effectiveness  of  polynomial 
preconditioning. 
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